Auto Insurance Claims

The Auto Insurance Claims Dataset was provided by Professor Paul Tallon as a CSV file. This dataset contains information about claims made to an insurance company that provides Personal, Special, and Corporate auto coverage plans across the states of Iowa, Kansas, Missouri, Nebraska, and Oklahoma.

Data Analysis

The Auto Insurance Claims Dataset includes a total of 26 columns and 9,134 rows.

The following table contains information about each column.


Column Name Type
Coustomer string
Country string
State Code string (Categorical)
State string (Categorical)
Claim Amount float
Response string (Categorical)
Coverage string (Categorical)
Education string (Categorical)
Effective To Date Date
EmploymentStatus string (Categorical)
Gender string (Categorical)
Income float
Location Code string (Categorical)
Marital Status string (Categorical)
Monthly Premium Auto float
Months Since Last Claim int
Months Since Policy Inception int
Number of Open Complaints int
Number of Policies int
Policy Type string (Categorical)
Policy string (Categorical)
Claim Reason string (Categorical)
Sales Channel string (Categorical)
Total Claim Amount float
Vehicle Class string (Categorical)
Vehicle Size string (Categorical)


None of the rows include NA or null values.

For the columns of type int or float, summary statistics can be found:


Claim Amount

Statistic Value
Minimum 189.80
1st Q 399.42
Mean 800.49
Median 578.01
3rd Q 896.21
Maximum 8,332.53


Income

Statistic Value
Minimum 0.00
1st Q 0.00
Mean 37,657.38
Median 33,889.50
3rd Q 62,320.00
Maximum 99,981.00


Monthly Premium Auto

Statistic Value
Minimum 61.00
1st Q 68.00
Mean 93.21
Median 83.00
3rd Q 109.00
Maximum 298.00


Monthly Premium Auto

Statistic Value
Minimum 61.00
1st Q 68.00
Mean 93.21
Median 83.00
3rd Q 109.00
Maximum 298.00


Months Since Last Claim

Statistic Value
Minimum 0
1st Q 6
Mean 15.09
Median 14
3rd Q 23
Maximum 35


Months Since Policy Inception

Statistic Value
Minimum 0
1st Q 24
Mean 48.06
Median 48
3rd Q 71
Maximum 99


Number of Open Complaints

Statistic Value
Minimum 0
1st Q 0
Mean 0.38
Median 0
3rd Q 0
Maximum 5


Number of Policies

Statistic Value
Minimum 1
1st Q 1
Mean 2.97
Median 2
3rd Q 4
Maximum 9


Total Claim Amount

Statistic Value
Minimum 0.10
1st Q 272.26
Mean 434.09
Median 383.95
3rd Q 547.51
Maximum 2,893.24



From these numerical summaries, it does not appear that any of the numerical (int or float) columns included in the dataset have abnormal distributions. Some columns, such as Number of Policies and Total Claim Amount, seem to be skewed, but that should not be cause for concern.


Findings

There are many relationships to be discovered in this dataset. The following
visualizations explore insights that may help future business operations.

Open Complaints by State and Reason

The scatterplot below depicts the number of open complaints broken up by each state and the different claim reasons. We can see that Collision is the Claim Reason that has the highest number of open claims across all states, while Other has the least across all states. We also see that Iowa and Missouri have the highest number of open complaints across the different claim reasons. When looking at particular pairings, it is evident that there are most open claims in Missouri for Collision, followed by open claims in Iowa for collisions. It appears that the (Kansas, Scratch/Dent), (Kansas, Other), and (Oklahoma, Other) pairings have the lowest number of open complaints. This information could help the auto insurance company to know what states and claim reasons are most common and to adjust their plan coverage and pricing accordingly. Further, it could help the company figure out what kinds of claims and which states need the most attention.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# read csv
df = pd.read_csv("./Auto_Insurance_Claims_Sample.csv")

# filter df to include only specific columns
scatter_df = df.groupby(["State", "Claim Reason"])["Number of Open Complaints"].sum().reset_index(name="sum")

# reorder claims
claim_order = ["Other", "Scratch/Dent", "Hail", "Collision"]
scatter_df['Claim Reason'] = pd.Categorical(scatter_df['Claim Reason'], categories=claim_order, ordered=True)
scatter_df = scatter_df.sort_values(['State', 'Claim Reason'])

# create scatterplot
plt.figure(figsize=(18, 8))
plt.scatter(scatter_df["State"], scatter_df["Claim Reason"], marker='8', cmap='cool',
            c=scatter_df["sum"], s=scatter_df["sum"] * 5, edgecolors="black")
plt.title("Number of Open Complaints by State and Claim Reason", fontsize=18)
plt.xlabel("State", fontsize=14)
plt.ylabel("Claim Reason", fontsize=14)
cbar = plt.colorbar() 
cbar.set_label("Number of Open Compaints", rotation=270, fontsize=14, color="black", labelpad=30)
my_colorbar_ticks = [*range(int(scatter_df["sum"].min()), int(scatter_df["sum"].max()), 100)]
cbar.set_ticks(my_colorbar_ticks)
cbar.set_ticklabels(my_colorbar_ticks)
plt.margins(x=0.1, y=0.1)
xticks = plt.xticks(scatter_df["State"])
yticks = plt.yticks(scatter_df["Claim Reason"])
plt.show()

Claim Amount by Vehicle Class

The bar chart below depicts the average claim amount by each class of vehicle. These averages are also compared to the overall average claim amount. It is evident that the “luxury” vehicle classes are both above the overall average, while the remaining vehicle classes fall below. It seems that the Luxury SUV and the Luxury Car classes have similar average claim amounts, showing that the SUV vs Car status does not matter as much as the “Luxury” title. Among the remaining vehicle classes, we see that the Sports Car group has the highest in average claim amount, followed by SUV, Two-Door Car, and Four-Door Car. This makes sense because we would expect Sports Cars to be more expensive than SUVs, which would be more expensive than Cars. Like the two luxury classes, we can see that there is not much difference in average claim amount for two-door and four-door cars. This information could be helpful to the auto insurance company so they know how to price their insurance policies for the different vehicle classes.

import matplotlib.patches as mpatches

def pick_colors_according_to_mean_count(this_data):
    """
    make a function to take df and assign color to row
    """
    colors = []
    avg = this_data["Claim Amount"].mean()
    for each in this_data["Claim Amount"]:
        if each > avg * 1.01:
            colors.append("palevioletred")
        elif each < avg * 0.99:
            colors.append("forestgreen")
        else:
            colors.append("silver")
    return colors

# filter df
bar_df = df.groupby(["Vehicle Class"]).agg({"Claim Amount": "mean"}).reset_index()
bar_df = bar_df.sort_values("Claim Amount", ascending=True)

# get colors
my_colors = pick_colors_according_to_mean_count(bar_df)

# create boundaries
above = mpatches.Patch(color="palevioletred", label="Above Average")
at = mpatches.Patch(color="silver", label="Within 1% of the Average")
below = mpatches.Patch(color="forestgreen", label="Below Average")

# set up plot
fig = plt.figure(figsize=(18, 12))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(bar_df["Vehicle Class"], bar_df["Claim Amount"], color=my_colors)  # horizontal bars is barh

# add labels to the end of each bar
for row_counter, value_at_row_counter in enumerate(bar_df["Claim Amount"]):
    if value_at_row_counter > bar_df["Claim Amount"].mean() * 1.01:
        color = "palevioletred"
    elif value_at_row_counter < bar_df["Claim Amount"].mean() * 0.99:
        color = "forestgreen"
    else:
        color = "silver"
    ax1.text(value_at_row_counter + 10, row_counter, str(f"{value_at_row_counter:,.2f}"), color=color,
             size=12, fontweight="bold", ha="left", va="center", backgroundcolor="white")
xlim = plt.xlim(0, bar_df["Claim Amount"].max() * 1.1)

ax1.legend(loc="lower right", handles=[above, at, below], fontsize=14)
plt.axvline(bar_df["Claim Amount"].mean(), color="black", linestyle="dashed")  # vertical reference line
ax1.text(bar_df["Claim Amount"].mean() + 2, 0, str(f"Mean = {bar_df['Claim Amount'].mean():,.2f}"), rotation=0, fontsize=14)

ax1.set_title("Average Claim Amount by Vehicle Class", size=20)
ax1.set_xlabel("Claim Amount", fontsize=16)
ax1.set_ylabel("Vehicle Class", fontsize=16)
ax1.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:,.0f}"))
xticks = plt.xticks(fontsize=14)
yticks = plt.yticks(fontsize=14)
plt.show()

Number of Policies vs Months Since Policy Inception

The waterfall diagram below depicts the deviation between the actual average number of policies and the average number of policies by months since policy inception bins. We see that there is a larger increase in policies from the 30-39 months since inception group to the 40-49 months since inception group. Further, there are decreases in the average number of policies from the 40-49 months since inception group to the 50-59 months since inception group, as well as the 80-89 months since inception group to the 90-99 months since inception group. It would be important to analyze the information that this diagram gives to see what may be the cause for these changes, as well as to see if the differences are statistically significant or not. If these results were statistically significant, this information could possibly inform the marketing team to target those in 40-49 range to see if they can sell more policies to those who may have only 1 current policy.

import os
import math
import plotly.graph_objects as go
import plotly.io as pio

# Sort values and reset index
wf_df = df[["Number of Policies", "Months Since Policy Inception"]].sort_values(by="Months Since Policy Inception")
wf_df.reset_index(inplace=True, drop=True)

# Define bins for Policy Inception grouping
bins = np.arange(0, 101, 10)  # Bins: [0, 10, ..., 100]
labels = [f"{i}-{i+9}" for i in bins[:-1]]  # Labels: "0-9", "10-19", ...

# Create a new column for binned groups
wf_df["Policy Inception Bin"] = pd.cut(wf_df["Months Since Policy Inception"], bins=bins, labels=labels, right=False)

# Get overall average
overall_avg = wf_df["Number of Policies"].mean()

# Group by binned months and calculate the average number of policies
wf_df = wf_df.groupby("Policy Inception Bin", observed=False)["Number of Policies"].mean().reset_index(name="Average Number of Policies")

# Compute deviations from overall average
wf_df["Aggregate Average"] = overall_avg
wf_df["Deviation"] = wf_df["Average Number of Policies"] - wf_df["Aggregate Average"]

wf_df["Cumulative Deviation"] = wf_df["Deviation"].cumsum()

# Set an aggregate row dynamically
wf_df.loc[len(wf_df)] = ["Total", overall_avg, overall_avg, wf_df["Deviation"].sum(), wf_df.loc[len(wf_df) - 1, "Cumulative Deviation"]]

# make range for y ticks
ymin = math.floor(wf_df["Cumulative Deviation"].min() * 100) / 100
ymax = math.ceil(wf_df["Cumulative Deviation"].max() * 100) / 100
yrange = np.arange(ymin, ymax, 0.04)

# Dynamically determine the color for the total bar
if wf_df.loc[len(wf_df) - 1, "Deviation"] > 0:
    end_color = "black"
elif wf_df.loc[len(wf_df) - 1, "Deviation"] < 0:
    end_color = "red"
else:
    end_color = "blue"

# Create the Waterfall Chart
fig = go.Figure(go.Waterfall(
    name="",
    orientation="v",
    x=wf_df["Policy Inception Bin"],
    textposition="outside",
    measure=["relative"] * (len(wf_df) - 1) + ["total"],  # Last row as total
    y=wf_df["Deviation"], 
    text=['avg {:.2f} pols'.format(each) for each in wf_df["Average Number of Policies"]],
    decreasing={"marker": {"color": "red"}},
    increasing={"marker": {"color": "green"}},
    totals={"marker": {"color": end_color}},
    hovertemplate="Cumulative Deviation: " + '%{y:,.2f}' + "<br />" + "Avg # Policies for %{x} months: %{text}"
))

# edit axes info
fig.layout = go.Layout(yaxis=dict(tickvals=yrange, tickformat='.2f'))
xax = fig.update_xaxes(title_text="Months Since Policy Inception", title_font={"size": 14})
yax = fig.update_yaxes(title_text="Deviation from Average for Number of Policies", title_font={"size": 14}, dtick=0.5, zeroline=True)

updatelayout = fig.update_layout(title=dict(text="Deviation between Actual Average and Monthly Average for Number of Policies by Months Since Policy Inception<br />" +
                             "Green Indicates Increase, Red Indicates Decrease", font=dict(family="Arial", size=14,
                                                                                            color="black")),
                  template="simple_white",
                  title_x=0.5,  # centered
                  showlegend=False,
                  autosize=True,
                  margin=dict(l=30, r=30, t=60, b=30)
                  )


fig.show()

Monthly Premium Auto vs Months Since Last Claim

The multiple line plot below depicts the count of claims submitted by the number of months since the customer’s last claim for each monthly premium for auto insurance groups. It is evident that the $50-$99 Monthly Premium Auto group has the highest number of overall claims across all Months Since Last Claim groups; however, the count does decrease as the Months Since Last Claim increases. The $100-$149 Monthly Premium Auto group follows a similar trend, though with less accounts. The remaining Monthly Premium Auto groups all have low counts across all Months Since Last Claim bins. While it seems that the $50-$99 and the $100-$149 Monthly Premium Auto groups have the highest instances of claims submitted, we must also remember that these groups may have the highest number of customer accounts. It may be important to recreate this chart using percentages instead of count for further analysis.

from matplotlib.ticker import FuncFormatter

# bin ranges for Monthly Premium Auto
premium_min = int(math.floor(df["Monthly Premium Auto"].min() / 50)) * 50
premium_max = int(math.ceil(df["Monthly Premium Auto"].max() / 50)) * 50
premium_bins = np.arange(premium_min, premium_max + 50, 50)
premium_labels = [f"${premium_bins[i]}-${premium_bins[i+1]-1}" for i in range(len(premium_bins)-1)]

# bin ranges for Months Since Last Claim
claim_min = int(math.floor(df["Months Since Last Claim"].min() / 5)) * 5
claim_max = int(math.ceil(df["Months Since Last Claim"].max() / 5)) * 5
claim_bins = np.arange(claim_min, claim_max + 5, 5)
claim_labels = [f"{claim_bins[i]}-{claim_bins[i+1]-1}" for i in range(len(claim_bins)-1)]

# Copy new df
lines_df = df[["Monthly Premium Auto", "Months Since Last Claim"]].copy()

# bin columns
lines_df["Auto Premium Bin"] = pd.cut(lines_df["Monthly Premium Auto"], bins=premium_bins, labels=premium_labels, right=False)
lines_df["Months Since Last Claim Bin"] = pd.cut(lines_df["Months Since Last Claim"], bins=claim_bins, labels=claim_labels, right=False)

# rm unnecessary columns
del lines_df["Monthly Premium Auto"]
del lines_df["Months Since Last Claim"]

# convert bins to categorical with sorted order
lines_df["Auto Premium Bin"] = pd.Categorical(lines_df["Auto Premium Bin"], categories=premium_labels, ordered=True)
lines_df["Months Since Last Claim Bin"] = pd.Categorical(lines_df["Months Since Last Claim Bin"], categories=claim_labels, ordered=True)

# Group by binned categories and count occurrences
lines_df = lines_df.groupby(["Auto Premium Bin", "Months Since Last Claim Bin"], observed=False).size().reset_index(name="Count")

# Sort the dataframe
lines_df = lines_df.sort_values(by=["Auto Premium Bin", "Months Since Last Claim Bin"])

# set colors
number_outside_colors = len(lines_df["Auto Premium Bin"].unique())
outside_color_ref_number = np.arange(number_outside_colors)
colormap = plt.get_cmap("Set2")
outer_colors = colormap(outside_color_ref_number)
my_colors = {lab: outer_colors[i] for i, lab in enumerate(lines_df["Auto Premium Bin"].unique())}

fig, ax = plt.subplots(figsize=(10,6))

# make lines
for key, grp in lines_df.groupby(["Auto Premium Bin"], observed=False):
    grp.plot(ax=ax, kind="line", x="Months Since Last Claim Bin", y="Count", 
             color=my_colors[key[0]], label=key, marker='8')

# formatting
plt.title("Monthly Premium Auto by Months Since Last Claim", fontsize=18)
ax.set_xlabel("Months Since Last Claim", fontsize=18)
ax.set_ylabel("Count", fontsize=18, labelpad=20)

# make the y axis larger to fit the legend
y_min, y_max = ax.get_ylim()
ylim = ax.set_ylim(y_min, y_max * 1.2)

# add commas
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f"{x:,.0f}"))

# reformat labels
handles, labels = ax.get_legend_handles_labels()
labels = [lab.replace("$", "\\$") for lab in premium_labels]
plt.legend(handles, labels, title="Monthly Premium Auto", loc="best", fontsize=14, ncol=1)

plt.show()

Number of Policies vs Binned Income

The frequency heatmap below depicts the number of policies by income group. The lower frequencies show as purple. As the frequencies increase, the colors transition through the colors of the rainbow backwards, so that red shows the highest frequencies. From this heatmap, we can see that there are most accounts that only hold one policy. Additionally, we can see that there are more accounts in the income group of $0 to $9,999. Further, the pairing of $0 to $9,999 income with 1 policy has the highest frequency of all the pairs, coming in at 796 instances. This heatmap is helpful because it can show the auto insurance company who their main demographic is by income. Knowing this information helps the company to tailor their pricing, plans, and marketing strategies.

import seaborn as sns

# bin ranges for Income
income_min = int(math.floor(df["Income"].min() / 1000)) * 1000
income_max = int(math.ceil(df["Income"].max() / 1000)) * 1000
income_bins = np.arange(income_min, income_max + 10000, 10000)
income_labels = [f"${income_bins[i]:,}-${income_bins[i+1]-1:,}" for i in range(len(income_bins)-1)]

x = df[["Number of Policies", "Income"]].copy()
x["Income Bin"] = pd.cut(x["Income"], bins=income_bins, labels=income_labels, right=False)
x = x.groupby(["Number of Policies", "Income Bin"], observed=False).size().reset_index(name="Count")

heatmap_df = pd.pivot_table(x, index="Number of Policies", columns="Income Bin", values="Count", observed=False)

fig = plt.figure(figsize=(18, 14))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ","))
ax = sns.heatmap(
    heatmap_df,
    linewidth=0.2,
    annot=True,
    cmap="rainbow",
    fmt=',.0f',
    square=True,
    annot_kws={"size": 11},
    cbar_kws={"format": comma_fmt, "orientation": "vertical"}
)

plt.title("Frequency Heatmap of Number of Policies by Income ", fontsize=18, pad=15)
plt.xlabel("Income Group", fontsize=18, labelpad=10)
plt.ylabel("Number of Policies", fontsize=18, labelpad=10)
yticks = plt.yticks(rotation=0, size=14)
xticks = plt.xticks(rotation=45, size=14, ha="right")

ax.set_xticklabels([lab.replace("$", "\\$") for lab in income_labels])

ax.invert_yaxis()

cbar = ax.collections[0].colorbar

# max_count = hm_df.to_numpy().max()
max_count = x["Count"].max()

my_colorbar_ticks = [*range(100, max_count, 100)]
cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)

cbar.set_label("Count", rotation=270, fontsize=14, color="black", labelpad=20)

plt.show()

Conclusion

The visualizations included in this report give us valuable information about the auto insurance claims that were submitted to this insurance comapny that we would not have been able to discover if the data were not cleaned and processed. The scatterplot may be able to help the company adjust how they tackle their open claims, while the bar chart and multiple line plot can inform the company’s pricing, and the waterfall diagram and heatmap could influence the company’s marketing strategies.

While the visualizations created in this report have given us additional insight into the data included in the set, there can always be more data cleaning and preparation that would have resulted in visualizations and summaries that are more accurate and helpful. Further, the visualizations that were generated have allowed us to ask more valid questions that should be addressed in the future.