The Auto Insurance Claims Dataset was provided by Professor Paul Tallon as a CSV file. This dataset contains information about claims made to an insurance company that provides Personal, Special, and Corporate auto coverage plans across the states of Iowa, Kansas, Missouri, Nebraska, and Oklahoma.
The Auto Insurance Claims Dataset includes a total of 26 columns and 9,134 rows.
The following table contains information about each column.
| Column Name | Type |
|---|---|
| Coustomer | string |
| Country | string |
| State Code | string (Categorical) |
| State | string (Categorical) |
| Claim Amount | float |
| Response | string (Categorical) |
| Coverage | string (Categorical) |
| Education | string (Categorical) |
| Effective To Date | Date |
| EmploymentStatus | string (Categorical) |
| Gender | string (Categorical) |
| Income | float |
| Location Code | string (Categorical) |
| Marital Status | string (Categorical) |
| Monthly Premium Auto | float |
| Months Since Last Claim | int |
| Months Since Policy Inception | int |
| Number of Open Complaints | int |
| Number of Policies | int |
| Policy Type | string (Categorical) |
| Policy | string (Categorical) |
| Claim Reason | string (Categorical) |
| Sales Channel | string (Categorical) |
| Total Claim Amount | float |
| Vehicle Class | string (Categorical) |
| Vehicle Size | string (Categorical) |
None of the rows include NA or null values.
For the columns of type int or float, summary statistics can be found:
| Statistic | Value |
|---|---|
| Minimum | 189.80 |
| 1st Q | 399.42 |
| Mean | 800.49 |
| Median | 578.01 |
| 3rd Q | 896.21 |
| Maximum | 8,332.53 |
| Statistic | Value |
|---|---|
| Minimum | 0.00 |
| 1st Q | 0.00 |
| Mean | 37,657.38 |
| Median | 33,889.50 |
| 3rd Q | 62,320.00 |
| Maximum | 99,981.00 |
| Statistic | Value |
|---|---|
| Minimum | 61.00 |
| 1st Q | 68.00 |
| Mean | 93.21 |
| Median | 83.00 |
| 3rd Q | 109.00 |
| Maximum | 298.00 |
| Statistic | Value |
|---|---|
| Minimum | 61.00 |
| 1st Q | 68.00 |
| Mean | 93.21 |
| Median | 83.00 |
| 3rd Q | 109.00 |
| Maximum | 298.00 |
| Statistic | Value |
|---|---|
| Minimum | 0 |
| 1st Q | 6 |
| Mean | 15.09 |
| Median | 14 |
| 3rd Q | 23 |
| Maximum | 35 |
| Statistic | Value |
|---|---|
| Minimum | 0 |
| 1st Q | 24 |
| Mean | 48.06 |
| Median | 48 |
| 3rd Q | 71 |
| Maximum | 99 |
| Statistic | Value |
|---|---|
| Minimum | 0 |
| 1st Q | 0 |
| Mean | 0.38 |
| Median | 0 |
| 3rd Q | 0 |
| Maximum | 5 |
| Statistic | Value |
|---|---|
| Minimum | 1 |
| 1st Q | 1 |
| Mean | 2.97 |
| Median | 2 |
| 3rd Q | 4 |
| Maximum | 9 |
| Statistic | Value |
|---|---|
| Minimum | 0.10 |
| 1st Q | 272.26 |
| Mean | 434.09 |
| Median | 383.95 |
| 3rd Q | 547.51 |
| Maximum | 2,893.24 |
From these numerical summaries, it does not appear that any of the numerical (int or float) columns included in the dataset have abnormal distributions. Some columns, such as Number of Policies and Total Claim Amount, seem to be skewed, but that should not be cause for concern.
There are many relationships to be discovered in this dataset. The
following
visualizations explore insights that may help future business
operations.
The scatterplot below depicts the number of open complaints broken up by each state and the different claim reasons. We can see that Collision is the Claim Reason that has the highest number of open claims across all states, while Other has the least across all states. We also see that Iowa and Missouri have the highest number of open complaints across the different claim reasons. When looking at particular pairings, it is evident that there are most open claims in Missouri for Collision, followed by open claims in Iowa for collisions. It appears that the (Kansas, Scratch/Dent), (Kansas, Other), and (Oklahoma, Other) pairings have the lowest number of open complaints. This information could help the auto insurance company to know what states and claim reasons are most common and to adjust their plan coverage and pricing accordingly. Further, it could help the company figure out what kinds of claims and which states need the most attention.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read csv
df = pd.read_csv("./Auto_Insurance_Claims_Sample.csv")
# filter df to include only specific columns
scatter_df = df.groupby(["State", "Claim Reason"])["Number of Open Complaints"].sum().reset_index(name="sum")
# reorder claims
claim_order = ["Other", "Scratch/Dent", "Hail", "Collision"]
scatter_df['Claim Reason'] = pd.Categorical(scatter_df['Claim Reason'], categories=claim_order, ordered=True)
scatter_df = scatter_df.sort_values(['State', 'Claim Reason'])
# create scatterplot
plt.figure(figsize=(18, 8))
plt.scatter(scatter_df["State"], scatter_df["Claim Reason"], marker='8', cmap='cool',
c=scatter_df["sum"], s=scatter_df["sum"] * 5, edgecolors="black")
plt.title("Number of Open Complaints by State and Claim Reason", fontsize=18)
plt.xlabel("State", fontsize=14)
plt.ylabel("Claim Reason", fontsize=14)
cbar = plt.colorbar()
cbar.set_label("Number of Open Compaints", rotation=270, fontsize=14, color="black", labelpad=30)
my_colorbar_ticks = [*range(int(scatter_df["sum"].min()), int(scatter_df["sum"].max()), 100)]
cbar.set_ticks(my_colorbar_ticks)
cbar.set_ticklabels(my_colorbar_ticks)
plt.margins(x=0.1, y=0.1)
xticks = plt.xticks(scatter_df["State"])
yticks = plt.yticks(scatter_df["Claim Reason"])
plt.show()
The bar chart below depicts the average claim amount by each class of vehicle. These averages are also compared to the overall average claim amount. It is evident that the “luxury” vehicle classes are both above the overall average, while the remaining vehicle classes fall below. It seems that the Luxury SUV and the Luxury Car classes have similar average claim amounts, showing that the SUV vs Car status does not matter as much as the “Luxury” title. Among the remaining vehicle classes, we see that the Sports Car group has the highest in average claim amount, followed by SUV, Two-Door Car, and Four-Door Car. This makes sense because we would expect Sports Cars to be more expensive than SUVs, which would be more expensive than Cars. Like the two luxury classes, we can see that there is not much difference in average claim amount for two-door and four-door cars. This information could be helpful to the auto insurance company so they know how to price their insurance policies for the different vehicle classes.
import matplotlib.patches as mpatches
def pick_colors_according_to_mean_count(this_data):
"""
make a function to take df and assign color to row
"""
colors = []
avg = this_data["Claim Amount"].mean()
for each in this_data["Claim Amount"]:
if each > avg * 1.01:
colors.append("palevioletred")
elif each < avg * 0.99:
colors.append("forestgreen")
else:
colors.append("silver")
return colors
# filter df
bar_df = df.groupby(["Vehicle Class"]).agg({"Claim Amount": "mean"}).reset_index()
bar_df = bar_df.sort_values("Claim Amount", ascending=True)
# get colors
my_colors = pick_colors_according_to_mean_count(bar_df)
# create boundaries
above = mpatches.Patch(color="palevioletred", label="Above Average")
at = mpatches.Patch(color="silver", label="Within 1% of the Average")
below = mpatches.Patch(color="forestgreen", label="Below Average")
# set up plot
fig = plt.figure(figsize=(18, 12))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(bar_df["Vehicle Class"], bar_df["Claim Amount"], color=my_colors) # horizontal bars is barh
# add labels to the end of each bar
for row_counter, value_at_row_counter in enumerate(bar_df["Claim Amount"]):
if value_at_row_counter > bar_df["Claim Amount"].mean() * 1.01:
color = "palevioletred"
elif value_at_row_counter < bar_df["Claim Amount"].mean() * 0.99:
color = "forestgreen"
else:
color = "silver"
ax1.text(value_at_row_counter + 10, row_counter, str(f"{value_at_row_counter:,.2f}"), color=color,
size=12, fontweight="bold", ha="left", va="center", backgroundcolor="white")
xlim = plt.xlim(0, bar_df["Claim Amount"].max() * 1.1)
ax1.legend(loc="lower right", handles=[above, at, below], fontsize=14)
plt.axvline(bar_df["Claim Amount"].mean(), color="black", linestyle="dashed") # vertical reference line
ax1.text(bar_df["Claim Amount"].mean() + 2, 0, str(f"Mean = {bar_df['Claim Amount'].mean():,.2f}"), rotation=0, fontsize=14)
ax1.set_title("Average Claim Amount by Vehicle Class", size=20)
ax1.set_xlabel("Claim Amount", fontsize=16)
ax1.set_ylabel("Vehicle Class", fontsize=16)
ax1.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:,.0f}"))
xticks = plt.xticks(fontsize=14)
yticks = plt.yticks(fontsize=14)
plt.show()
The waterfall diagram below depicts the deviation between the actual average number of policies and the average number of policies by months since policy inception bins. We see that there is a larger increase in policies from the 30-39 months since inception group to the 40-49 months since inception group. Further, there are decreases in the average number of policies from the 40-49 months since inception group to the 50-59 months since inception group, as well as the 80-89 months since inception group to the 90-99 months since inception group. It would be important to analyze the information that this diagram gives to see what may be the cause for these changes, as well as to see if the differences are statistically significant or not. If these results were statistically significant, this information could possibly inform the marketing team to target those in 40-49 range to see if they can sell more policies to those who may have only 1 current policy.
import os
import math
import plotly.graph_objects as go
import plotly.io as pio
# Sort values and reset index
wf_df = df[["Number of Policies", "Months Since Policy Inception"]].sort_values(by="Months Since Policy Inception")
wf_df.reset_index(inplace=True, drop=True)
# Define bins for Policy Inception grouping
bins = np.arange(0, 101, 10) # Bins: [0, 10, ..., 100]
labels = [f"{i}-{i+9}" for i in bins[:-1]] # Labels: "0-9", "10-19", ...
# Create a new column for binned groups
wf_df["Policy Inception Bin"] = pd.cut(wf_df["Months Since Policy Inception"], bins=bins, labels=labels, right=False)
# Get overall average
overall_avg = wf_df["Number of Policies"].mean()
# Group by binned months and calculate the average number of policies
wf_df = wf_df.groupby("Policy Inception Bin", observed=False)["Number of Policies"].mean().reset_index(name="Average Number of Policies")
# Compute deviations from overall average
wf_df["Aggregate Average"] = overall_avg
wf_df["Deviation"] = wf_df["Average Number of Policies"] - wf_df["Aggregate Average"]
wf_df["Cumulative Deviation"] = wf_df["Deviation"].cumsum()
# Set an aggregate row dynamically
wf_df.loc[len(wf_df)] = ["Total", overall_avg, overall_avg, wf_df["Deviation"].sum(), wf_df.loc[len(wf_df) - 1, "Cumulative Deviation"]]
# make range for y ticks
ymin = math.floor(wf_df["Cumulative Deviation"].min() * 100) / 100
ymax = math.ceil(wf_df["Cumulative Deviation"].max() * 100) / 100
yrange = np.arange(ymin, ymax, 0.04)
# Dynamically determine the color for the total bar
if wf_df.loc[len(wf_df) - 1, "Deviation"] > 0:
end_color = "black"
elif wf_df.loc[len(wf_df) - 1, "Deviation"] < 0:
end_color = "red"
else:
end_color = "blue"
# Create the Waterfall Chart
fig = go.Figure(go.Waterfall(
name="",
orientation="v",
x=wf_df["Policy Inception Bin"],
textposition="outside",
measure=["relative"] * (len(wf_df) - 1) + ["total"], # Last row as total
y=wf_df["Deviation"],
text=['avg {:.2f} pols'.format(each) for each in wf_df["Average Number of Policies"]],
decreasing={"marker": {"color": "red"}},
increasing={"marker": {"color": "green"}},
totals={"marker": {"color": end_color}},
hovertemplate="Cumulative Deviation: " + '%{y:,.2f}' + "<br />" + "Avg # Policies for %{x} months: %{text}"
))
# edit axes info
fig.layout = go.Layout(yaxis=dict(tickvals=yrange, tickformat='.2f'))
xax = fig.update_xaxes(title_text="Months Since Policy Inception", title_font={"size": 14})
yax = fig.update_yaxes(title_text="Deviation from Average for Number of Policies", title_font={"size": 14}, dtick=0.5, zeroline=True)
updatelayout = fig.update_layout(title=dict(text="Deviation between Actual Average and Monthly Average for Number of Policies by Months Since Policy Inception<br />" +
"Green Indicates Increase, Red Indicates Decrease", font=dict(family="Arial", size=14,
color="black")),
template="simple_white",
title_x=0.5, # centered
showlegend=False,
autosize=True,
margin=dict(l=30, r=30, t=60, b=30)
)
fig.show()
The frequency heatmap below depicts the number of policies by income group. The lower frequencies show as purple. As the frequencies increase, the colors transition through the colors of the rainbow backwards, so that red shows the highest frequencies. From this heatmap, we can see that there are most accounts that only hold one policy. Additionally, we can see that there are more accounts in the income group of $0 to $9,999. Further, the pairing of $0 to $9,999 income with 1 policy has the highest frequency of all the pairs, coming in at 796 instances. This heatmap is helpful because it can show the auto insurance company who their main demographic is by income. Knowing this information helps the company to tailor their pricing, plans, and marketing strategies.
import seaborn as sns
# bin ranges for Income
income_min = int(math.floor(df["Income"].min() / 1000)) * 1000
income_max = int(math.ceil(df["Income"].max() / 1000)) * 1000
income_bins = np.arange(income_min, income_max + 10000, 10000)
income_labels = [f"${income_bins[i]:,}-${income_bins[i+1]-1:,}" for i in range(len(income_bins)-1)]
x = df[["Number of Policies", "Income"]].copy()
x["Income Bin"] = pd.cut(x["Income"], bins=income_bins, labels=income_labels, right=False)
x = x.groupby(["Number of Policies", "Income Bin"], observed=False).size().reset_index(name="Count")
heatmap_df = pd.pivot_table(x, index="Number of Policies", columns="Income Bin", values="Count", observed=False)
fig = plt.figure(figsize=(18, 14))
ax = fig.add_subplot(1, 1, 1)
comma_fmt = FuncFormatter(lambda x, p: format(int(x), ","))
ax = sns.heatmap(
heatmap_df,
linewidth=0.2,
annot=True,
cmap="rainbow",
fmt=',.0f',
square=True,
annot_kws={"size": 11},
cbar_kws={"format": comma_fmt, "orientation": "vertical"}
)
plt.title("Frequency Heatmap of Number of Policies by Income ", fontsize=18, pad=15)
plt.xlabel("Income Group", fontsize=18, labelpad=10)
plt.ylabel("Number of Policies", fontsize=18, labelpad=10)
yticks = plt.yticks(rotation=0, size=14)
xticks = plt.xticks(rotation=45, size=14, ha="right")
ax.set_xticklabels([lab.replace("$", "\\$") for lab in income_labels])
ax.invert_yaxis()
cbar = ax.collections[0].colorbar
# max_count = hm_df.to_numpy().max()
max_count = x["Count"].max()
my_colorbar_ticks = [*range(100, max_count, 100)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)
cbar.set_label("Count", rotation=270, fontsize=14, color="black", labelpad=20)
plt.show()
The visualizations included in this report give us valuable information about the auto insurance claims that were submitted to this insurance comapny that we would not have been able to discover if the data were not cleaned and processed. The scatterplot may be able to help the company adjust how they tackle their open claims, while the bar chart and multiple line plot can inform the company’s pricing, and the waterfall diagram and heatmap could influence the company’s marketing strategies.
While the visualizations created in this report have given us additional insight into the data included in the set, there can always be more data cleaning and preparation that would have resulted in visualizations and summaries that are more accurate and helpful. Further, the visualizations that were generated have allowed us to ask more valid questions that should be addressed in the future.