Auto Insurance Analysis

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH']='C:/ProgramData/Anaconda3/Library/plugins/platforms'

Introduction

In the highly competitive auto insurance industry, maintaining profitability requires a granular understanding of where and why financial liabilities occur. Standard metrics could fail to capture the nuanced realities of claim severities hidden within the data. This report applies data visualization techniques to a sample auto insurance database to elucidate the underlying risk profiles of the customer base. By identifying high-risk socio-economic pockets, tracking cumulative budget deviations, and more, this analysis provides a clearer path for insurance managers to implement effective pricing models and mitigate financial exposure.

Dataset

The provided dataset contains comprehensive historical data regarding auto insurance policies and their associated claims. Key variables analyzed in this report include Total Claim Amount (representing the financial severity of a filed claim), Income and Employment Status (demographic indicators), Monthly Premium Auto (the current pricing structure), and Vehicle Class (the asset type insured). Time series components, such as Effective To Date, are utilized to track the volatility of claims across sequential weeks.

Findings

The following visualizations deconstruct the financial liabilities of the insurance portfolio. Navigate through the tabs below to explore the risks identified within the data.

Scatterplot

The scatterplot below highlights a critical “danger zone” in the portfolio. By mapping income levels against monthly premiums, a distinct pocket of extremely high-severity claims (highlighted by the dashed triangle) emerges among lower-income drivers paying premium rates.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
import plotly.io as pio

filename = "U:/Auto_Insurance_Claims.csv"
df = pd.read_csv(filename, usecols=['Income', 'Monthly Premium Auto', 'Total Claim Amount', 'EmploymentStatus', 'Vehicle Class', 'Policy Type', 'Coverage', 'Effective To Date', 'Location Code'])

# Grouping Income into $10k brackets
df['Income_10k'] = (df['Income'] // 10000) * 10

# Grouping Monthly Premium into $20 brackets
df['Premium_20s'] = (df['Monthly Premium Auto'] // 20) * 20

# Aggregating the data and set marker size
agg_df = df.groupby(['Income_10k', 'Premium_20s'])['Total Claim Amount'].mean().reset_index()
agg_df['Size_Scaled'] = agg_df['Total Claim Amount'] * 0.6 

# Plotting the scatterplot
plt.figure(figsize=(18, 10))
plt.scatter(agg_df['Income_10k'], agg_df['Premium_20s'], 
            marker='s', cmap='plasma', 
            c=agg_df['Total Claim Amount'], 
            vmin=200, vmax=1200,
            s=agg_df['Size_Scaled'], alpha=0.8, edgecolors='black')

# Adding Titles and Labels
plt.title('The Socio-Economic Risk Grid: Claim Amount by Income and Premium Level', fontsize=18)
plt.xlabel('Customer Income Level (in Thousands $)', fontsize=14, labelpad=15)
plt.ylabel('Monthly Auto Premium ($)', fontsize=14, labelpad=15)

# Adding the Colorbar
cbar = plt.colorbar()
cbar.set_label('Average Claim Amount ($)', rotation=270, fontsize=14, color='black', labelpad=30)
my_colorbar_ticks = cbar.get_ticks()
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [ '{:,}'.format(int(each)) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)

# Altering the axes for a clean grid
plt.xticks(np.arange(0, 110, 10), fontsize=12)

plt.yticks(np.arange(60, 320, 20), fontsize=12)

plt.grid(True, linestyle='--', alpha=0.3)

# Highlighting high risk factors
triangle_x = [-5, -5, 95, -5] 
triangle_y = [80, 300, 300, 80]
plt.plot(triangle_x, triangle_y, color='black', linestyle='--', linewidth=3)
plt.text(3, 292, 'High Risk Exposure', color='black', fontsize=12, fontweight='bold')

# Finalizing
plt.xlim(right=95)

plt.show()

Dual-axis Bar Chart

While employed individuals generate the vast majority of total policy volume, this dual-axis chart reveals that unemployed drivers carry the highest average claim severity. This disparity suggests a potential misalignment in current demographic risk assessments.

# Aggregating the data
d2 = df.groupby('EmploymentStatus')['Total Claim Amount'].agg(['count', 'mean']).reset_index()
d2.columns = ['EmploymentStatus', 'Count', 'AvgClaim']

# Sorting by count
d2 = d2.sort_values('Count', ascending=False).reset_index(drop=True)

# Building the Dual-Axis Chart
fig = plt.figure(figsize=(18, 10))
ax1 = fig.add_subplot(1,1,1)
ax2 = ax1.twinx()
bar_width = 0.4

# Creating the x-axis positions for the bars
x_pos = np.arange(len(d2))

# Plotting the two sets of bars side-by-side
count_bars = ax1.bar(x_pos - (0.5 * bar_width), d2['Count'], bar_width, 
                     color='orange', edgecolor='black', label='Policy Count')

avg_bars = ax2.bar(x_pos + (0.5 * bar_width), d2['AvgClaim'], bar_width, 
                   color='darkblue', edgecolor='black', label='Average Claim ($)')

# Formatting the axes
ax1.set_xlabel('Employment Status', fontsize=18, labelpad=15)
ax1.set_ylabel('Total Number of Policies', fontsize=18, labelpad=20)
ax2.set_ylabel('Average Claim Amount ($)', fontsize=18, rotation=270, labelpad=30)

ax1.set_xticks(x_pos)
ax1.set_xticklabels(d2['EmploymentStatus'], fontsize=14)

my_ax1_ticks = ax1.get_yticks()
ax1.set_yticks(my_ax1_ticks)
ax1.set_yticklabels([ '{:,}'.format(int(each)) for each in my_ax1_ticks])

my_ax2_ticks = ax2.get_yticks()
ax2.set_yticks(my_ax2_ticks)
ax2.set_yticklabels([ '{:,}'.format(int(each)) for each in my_ax2_ticks])

ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)

plt.title('Demographic Risk: Policy Volume vs. Average Claim Amount', fontsize=20)

# Setting the Legend and the Grid
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper right', fontsize=14)

# Adding labels to bars
def autolabel(sam_bars, sam_ax, place_of_decimals, symbol):
    for each_bar in sam_bars:
        height = each_bar.get_height()
        sam_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimals),
                    fontsize=12, color='black', ha='center', va='bottom', fontweight='bold')

ax1.set_ylim(0, d2['Count'].max() * 1.10)

ax2.set_ylim(0, d2['AvgClaim'].max() * 1.10)

autolabel(count_bars, ax1, ',.0f', '')
autolabel(avg_bars, ax2, ',.0f', '$')

plt.grid(True, linestyle='--', alpha=0.3)
plt.show()

Heatmap

Standard intuition might suggest that urban claims are the most expensive due to traffic density. However, this heatmap reveals a glaring financial vulnerability in the suburbs. By crossing location types with vehicle classes, a distinct “red zone” of extreme claim severity emerges for Suburban Luxury vehicles.

# Pivot the data to create the grid (Rows: Location, Columns: Vehicle)
hm_df = df.groupby(['Location Code', 'Vehicle Class'])['Total Claim Amount'].mean().unstack()

# Reorder the rows to flow logically by population density
if 'Rural' in hm_df.index:
    hm_df = hm_df.reindex(['Rural', 'Suburban', 'Urban'])

# Create a custom nested array to format the numeric values inside the cells
annot_array = np.array([[f"${val:,.0f}" if pd.notna(val) else "" for val in row] for row in hm_df.values])

# Set up the figure space
plt.figure(figsize=(16, 8))

# Build the Heatmap
ax = sns.heatmap(hm_df, annot=annot_array, fmt='', cmap='Reds', 
                 linewidths=1.5, linecolor='lightgray', 
                 annot_kws={"size": 13, "weight": "bold"},
                 cbar_kws={'label': 'Average Claim Amount ($)'})

# Format Titles and Axis Labels
plt.title('Geographic Risk Profile: Average Claim Severity by Location & Vehicle', fontsize=18, pad=20)
plt.ylabel('Location Type', fontsize=15, labelpad=10)
plt.xlabel('Vehicle Class', fontsize=15, labelpad=10)

# Update tick marks for maximum readability
plt.xticks(rotation=45, ha='right', fontsize=13)

plt.yticks(rotation=0, fontsize=13)

# Format the Colorbar legend on the side
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
cbar.set_label('Average Claim Amount ($)', rotation=270, fontsize=14, labelpad=25)
my_colorbar_ticks = cbar.get_ticks()
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [ '{:,}'.format(int(each)) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)

plt.tight_layout()
plt.show()

Bump Chart

When evaluated on a monthly basis, claim severities appear relatively stable. However, when the timeline is compressed, significant volatility is revealed. This bump chart tracks the dynamic week-to-week shifts in liability rankings among vehicle classes.

# Calculate sequential week
df['Effective To Date'] = pd.to_datetime(df['Effective To Date'], format='%m/%d/%y')
start_date = df['Effective To Date'].min()
df['Week'] = ((df['Effective To Date'] - start_date).dt.days // 7) + 1

# Set up df for Bump Chart
bump_df = df.groupby(['Vehicle Class', 'Week'])['Total Claim Amount'].mean().reset_index(name='AvgClaim')
bump_df = bump_df.pivot(index='Vehicle Class', columns='Week', values='AvgClaim')
bump_df_ranked = bump_df.rank(0, ascending=False, method='min').T

# Plot Bump Chart
fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=2, linewidth=6,
                    markersize=44, markerfacecolor='white')

ax.invert_yaxis()
num_cols = bump_df_ranked.shape[1]

# Format labels and titles
plt.ylabel('Weekly Ranking', fontsize=18, labelpad=10)
plt.title('Weekly Ranking of Average Claim Severity by Vehicle Class', fontsize=22, pad=15)

plt.yticks(range(1, num_cols + 1, 1), fontsize=16)

plt.xticks(bump_df_ranked.index, fontsize=16)

ax.set_xlabel('Sequential Week (Jan 1 - Feb 28)', fontsize=18)

# Format the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, title='Vehicle Class', bbox_to_anchor=(1.01, 1.01), fontsize=14,
          labelspacing=1, markerscale=0.4, borderpad=1, handletextpad=0.8)

# Add internal marker labels
i = 0
j = 0
for col in bump_df_ranked.columns:
    for row in bump_df_ranked.index:
        rank = bump_df_ranked.iloc[i,j]
        val = bump_df.iloc[j,i]
        text_str = f"${val:,.0f}"
        ax.text(row, rank, text_str, ha='center', va='center', fontsize=11, fontweight='bold')
        i += 1
    j += 1
    i = 0

plt.tight_layout()
plt.show()

Note: Data restricted to January & February

Nested Donut Chart

The nested donut chart shows how different tiers of insurance products contribute to the overall average claim size. The uniformity in financial severity across these varying policy types and coverage levels represents an insight in itself. This parity indicates that liabilities are distributed evenly throughout the product portfolio, suggesting that a claim’s ultimate cost is driven more by external risk factors than by the specific insurance tier a customer selects.

# Build the dataframe
pie_df = df.groupby(['Policy Type', 'Coverage'])['Total Claim Amount'].mean().reset_index(name='AvgClaim')

# Sort alphabetically to align the colormap sequence
pie_df.sort_values(by=['Policy Type', 'Coverage'], inplace=True)
pie_df.reset_index(inplace=True, drop=True)

# Define color reference numbers
outside_color_ref_number = [0, 4, 12]
all_color_ref_number = []
for base in outside_color_ref_number:
    all_color_ref_number.extend([base, base+1, base+2, base+3])

inside_color_ref_number = [each for each in all_color_ref_number if each not in outside_color_ref_number]

# Initialize the figure
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)
inner_colors = colormap(inside_color_ref_number)

overall_avg = df['Total Claim Amount'].mean()

# Plot the outer pie chart
pie_df.groupby(['Policy Type'], sort=False)['AvgClaim'].sum().plot(
    kind='pie', radius=1, colors=outer_colors, pctdistance=0.85,
    labeldistance=1.1, wedgeprops=dict(edgecolor='w'), textprops={'fontsize': 14},
    autopct='%1.1f%%', 
    startangle=90
)

# Custom inner label formatting
def make_inner_label(pct, allvals, labels):
    absolute = int(np.round(pct/100.*np.sum(allvals)))
    current_label = labels.pop(0) 
    return f"{current_label}\n${absolute:,.0f}"

coverage_labels = pie_df['Coverage'].tolist()

# Plot the inner pie chart
pie_df['AvgClaim'].plot(
    kind='pie', radius=0.7, colors=inner_colors, pctdistance=0.75, 
    labels=None, wedgeprops=dict(edgecolor='w'), textprops={'fontsize': 11},
    autopct=lambda pct: make_inner_label(pct, pie_df['AvgClaim'], coverage_labels),
    startangle=90
)

# Center hole
hole = plt.Circle((0, 0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Severity by Product Mix: Average Claim Amount', fontsize=18, pad=20)

center_text = f"Average\nClaim Size\n${overall_avg:,.0f}"
ax.text(0, 0, center_text, size=16, ha='center', va='center', fontweight='bold')

plt.text(0, -1.2, "*Outer percentages represent the share of total average claim severity by Policy Type", 
         ha='center', va='center', fontsize=12, style='italic')

ax.axis('equal')

plt.tight_layout()
plt.show()

Conclusion

The visualizations presented in this report confirm that financial liability within the auto insurance portfolio is not evenly distributed. By analyzing demographic variables, it becomes apparent that unemployed drivers generate disproportionately expensive claims compared to their employed counterparts. Furthermore, a deeper look into socio-economic indicators exposes a severe risk pocket among lower-income customers paying premium rates.

Beyond demographic profiling, geographic and asset-based analyses provide actionable insights for risk mitigation. The geographic heatmap exposes suburban luxury vehicles as an extreme financial liability. Tracking temporal volatility through the bump chart demonstrates that vehicle-class risks fluctuate significantly on a week-to-week basis. Additionally, understanding the product mix reveals the exact severity burdens carried by distinct policy types and coverage tiers.

Management can leverage these localized insights to refine current risk models. Implementing targeted rate adjustments for high-risk suburban policies or reevaluating the pricing structure for unemployed segments will ensure the long-term financial resilience of the portfolio.