Analysis of Auto Insurance Claims

Introduction

The visuals presented in this analysis are based on auto insurance claim data from January to February of 2011. This analysis will address the details of different auto insurance claims as well as potential reasons for the denial of these requests. Do certain claim reasons get rejected more than others? Is there a relationship between coverage types and claim dollar amounts? Is the distribution of states varied across each claim reason? Is there a relationship between coverage type and claim reason? This analysis will also look at trends relating to demographics. Does a person’s education level or gender indicate their likelihood of filing an auto insurance claim? These questions will be answered through the visualizations illustrated below.

Dataset

The data set consists of 9,134 observations and 26 variables. As described above, the data set displays auto insurance claim data between January and February of 2011. In addition, it only includes data from five Midwestern states including Iowa, Missouri, Kansas, Nebraska, and Oklahoma. A few key variables from the data set that will be further investigated in this analysis includes education, state, claim reason, coverage, claim amount, gender, and response. These variables will be used to determine if any relationships exist within the data.

Findings

The following tabs include visualizations to help answer the above questions regarding auto insurance claims. These visualizations will be looking at the count of auto insurance claims in relation to claim reason, response type, coverage type, education level, state, and gender. They will also be looking at the average dollar amount of auto insurance claims in relation to claim reason and how these amounts correspond to the responses regarding the receival of coverage.

Tab 1

The below line plot reveals the number of claims made for each type of claim reason. The plot has two lines that distinguish between whether or not each claim was approved.

From the line plot, it is evident that no matter the claim reason, there are a higher amount of “No” responses than “Yes” responses when it comes to receiving coverage. Additionally, the data has a range of counts going from a high of 3,158 to a low of 30. There have been no “Yes” responses when it comes to the claim reason of other. Collision and hail appear to be the top two claim reasons that exist in the data set, with scratch/dent and other following behind.

setwd("U:/")

file1 <- "R_datafiles/Auto_Insurance_Claims.csv"

library(data.table)
library(plyr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(dplyr)
library(lubridate)
library(DescTools)
library(ggrepel)
library(plotly)

df <-fread(file1)

df1 <- count(df, Response, `Claim Reason`)
df1 <- df1[order(df1$n, decreasing = TRUE),]

hi_lo <-df1 %>%
  filter(n==min(n) | n==max(n)) %>%
  data.frame()

p1 <- ggplot(df1, aes(x=`Claim Reason`, y=n, group=Response))+
  geom_line(aes(color=Response), size=3)+
  labs(title="Auto Insurance Claim Count by Reason and Response Type", x="Claim Reason", y="Claim Count")+
  theme_light()+
  theme(plot.title=element_text(hjust=0.5))+
  geom_point(shape=21, size=5, color="black", fill="white")+
  scale_y_continuous(labels=comma)+
  scale_color_brewer(palette="Accent", name="Response", guide=guide_legend(reverse=TRUE))+
  geom_point(data=hi_lo, aes(x=Claim.Reason, y=n), shape=21, size=4, fill="pink", color="pink")+
  geom_label_repel(aes(label=ifelse(n==min(n) | n==max(n), scales::comma(n),"")),
                   box.padding=1,
                   point.padding=1,
                   size=4,
                   color="grey7",
                   segment.color="darkblue")
p1

Tab 2

The next plot reveals the number of claims made for each coverage type. Each stacked bar segment represents a different claim reason.

Basic coverage has the largest total count of claims with a count of 5,568, which is then followed by extended coverage (2,742) and premium coverage (824). It makes sense that basic coverage possesses the highest claim count because it is the cheapest coverage option. A cheaper coverage option will likely appeal to a larger number of people. Thus, a bigger population leads to a higher number of potential claims to be filed.

It is evident that though the three different types of coverage differ in claim counts, they all follow a similar pattern regarding claim reason. The largest number of claims filed within each coverage type are due to collisions. The next claim reasons in order from largest to smallest for all coverage types include hail, scratch/dent, and other.

df2 <- count(df,`Claim Reason`, Coverage)
df2 <- df2[order(df2$n, decreasing=TRUE)]

coverage_type <- df2 %>%
  group_by(Coverage) %>%
  summarise(TotalResponse=sum(n)) %>%
  data.frame()

maximum_y <- round_any(max(coverage_type$TotalResponse), 1000, ceiling)

ggplot(df2, aes(x=reorder(Coverage,n,sum), y=n, fill=`Claim Reason`))+
  geom_bar(stat="identity", position=position_stack(reverse=TRUE))+
  coord_flip()+
  labs(title="Auto Insurance Claim Count by Coverage Type and Reason", x="Coverage Type", y="Claim Count", fill="Claim Reason")+
  theme_light()+
  theme(plot.title=element_text(hjust=0.5))+
  scale_fill_brewer(palette="Pastel1")+
  geom_text(data=coverage_type, aes(x=Coverage, y=TotalResponse, label=scales::comma(TotalResponse), fill=NULL), hjust=-0.1, size=4)+
  scale_y_continuous(labels=comma,
                     breaks=seq(0,maximum_y,by=1000),
                     limits=c(0,maximum_y))

Tab 3

The following visual reveals the average claim amounts associated with each type of claim reason. Similar to the previous visual, each stacked bar segment represents a different coverage type. The line plot portion of the graph indicates the number of “No” responses associated with each claim reason.

No matter the claim reason, it is evident that those with premium coverage generate the largest claim dollar amounts. The premium coverage claim dollar amounts are $1,162, $1,031, $1,011, and $989. Premium coverage is then followed by those with extended coverage ($954, $890, $809, and $788), and then those with basic coverage ($772, $725, $674, and $658). It is reasonable to assume that those with premium coverage have a higher claim dollar amount because those with premium coverage are likely prone to more accidents, typically accidents resulting in greater monetary damages. The reason why they paid the extra money to have premium coverage in the first place is likely a result of their susceptibility to accidents. Therefore, premium coverage holders are more likely to file larger claim amounts when they file claims.

Similar to the previous visualization, though the four different types of claim reasons differ in claim amounts, they all follow a similar pattern regarding coverage type. As previously mentioned, this pattern indicates that those with premium coverage generate the largest claim dollar amounts no matter the claim reason, followed by those with extended and basic coverage.

Additionally, the claim reason that generates the largest claim amounts are collisions. From there, collisions are followed by scratch/dent, hail, and other. It is interesting to note that not only do collisions generate the highest claims amounts, but they also generate the largest number of “No” responses as well when it comes to receiving coverage. Similarly, other generates the lowest number of both claim amounts and “No” responses.

df3 <- df %>%
  select(`Claim Amount`, `Claim Reason`, Coverage) %>%
  group_by(`Claim Reason`, Coverage) %>%
  summarise(AvgClaimAmt=mean(`Claim Amount`), .groups='keep') %>%
  group_by(`Claim Reason`) %>%
  mutate(label_position = cumsum(AvgClaimAmt) - (AvgClaimAmt / 2)) %>%
  ungroup() %>%
  data.frame()

df3 <- df3[order(df3$AvgClaimAmt, decreasing=TRUE),]

df4 <- count(df, Response, `Claim Reason`)
df4 <- df4[order(df4$n, decreasing = TRUE),]

YesResponse <- which(df4$Response %in% c("Yes"))

df4 <- df4[-YesResponse,]

ylab <- seq(0,max(df4$n)/100,1)

my_labels <- comma(ylab*1000)

ggplot(df3, aes(x=reorder(Claim.Reason,AvgClaimAmt,sum), y=AvgClaimAmt, fill=Coverage))+
  geom_bar(stat="identity", position=position_stack(reverse=TRUE))+
  coord_flip()+
  theme_light()+
  labs(title="Average Auto Insurance Claim Amounts by Claim Reason", x="Claim Reason", y="Average Claim Amount", fill="Coverage Type")+
  theme(plot.title=element_text(hjust=0.5))+
  scale_fill_brewer(palette="Accent")+
  geom_line(inherit.aes=FALSE,data=df4,
            aes(x=`Claim Reason`, y=n*2.5, colour="Total 'No' Responses", group=1), size=1)+
  scale_color_manual(NULL,values="black")+
  scale_y_continuous(labels=dollar,
                     sec.axis=sec_axis(~./2.5, name="Total 'No' Responses", labels=my_labels, breaks=ylab*1000))+
  geom_point(inherit.aes=FALSE, data=df4, 
             aes(x=`Claim Reason`, y=n*2.5, group=1),
             size=3, shape=21, fill="white", color="black")+
  geom_text(aes(y = label_position, label = dollar(round(AvgClaimAmt,0))),
            size = 3, color = "white") +  
  theme(legend.background=element_rect(fill="transparent"),
        legend.box.background=element_rect(fill="transparent", colour=NA),
        legend.spacing=unit(-1,"lines"))

Tab 4

The heat map presented below displays the number of claims requested by each individual in relation to their education level and their reason for filing the claim.

From the heat map, one can see that collectively, most auto insurance claims are filed by those with bachelor’s degrees, followed by those in college, and then those in high school or below. The two lowest counts when it comes to filing claims comes from those with master’s degrees, which is then followed by those with doctorate degrees. There is logic behind these numbers because there is a larger population of people with bachelor’s degrees than doctorate degrees. Therefore, a bigger population leads to more people being available to file auto insurance claims.

In addition, as shown with the previous charts, most claims are filed due to collisions, followed by hail, scratch/dent, and other.

df5 <- count(df, `Claim Reason`, Education)
df5 <- df5[order(df5$n, decreasing=TRUE)]

mylevels <- c('Scratch/Dent', 'Other', 'Hail', 'Collision')
mylevels2 <- c('High School or Below', 'College', 'Bachelor', 'Master', 'Doctor')

df5$`Claim Reason` <- factor(df5$`Claim Reason`, levels=mylevels)
df5$Education <- factor(df5$Education, levels=mylevels2)

breaks <- c(seq(0,max(df5$n), by=200))

ggplot(df5, aes(x=`Claim Reason`, y=Education, fill=n)) +
  geom_tile(color='black') +
  geom_text(aes(label=comma(n))) +
  coord_equal(ratio=1) +
  labs(title="Claims by Education Type by Claim Reason",
       x="Claim Reason",
       y="Education Type",
       fill="Claim Count") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5))+
  scale_y_discrete(limits=rev(levels(df5$Education))) +
  scale_fill_continuous(low="white", high="purple", breaks=breaks, labels=comma) +
  guides(fill=guide_legend(reverse=TRUE), override.aes=list(colour="black"))

Tab 5

The pie chart represents the number of claims that were filed based on each claim reason type. The chart makes a distinction between each state the claim was filed in. It provides both a count of claims and the percentage each state contributes to each claim reason.

The pie chart illustrates that each state contributes a comparable portion to the overall total for each claim reason. For example, when looking at Missouri, it makes up approximately 34.5% of each claim reason (collision, hail, scratch/dent, other). Additionally, Iowa makes up approximately 28.5% of each claim reason, Nebraska makes up approximately 18.5% of each claim reason, Oklahoma makes up approximately 9.5% of each claim reason, and Kansas makes up approximately 8.5% of each claim reason. These similar distributions among claim reasons emphasizes that auto insurance claims across each of the five states are similar to one another, and one state isn’t more prone to a claim reason over another. Overall, Missouri generates the most claims with Iowa, Nebraska, Oklahoma, and Kansas following behind.

state_df <- df %>%
  group_by(`Claim Reason`, State) %>%
  summarise(n=length(State), .groups='keep') %>%
  group_by(`Claim Reason`) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()

plot_ly(hole=0.7) %>%
  layout(title="Claims Count By Claim Reason By State") %>%
  add_trace(data=state_df[state_df$Claim.Reason=="Collision",],
            labels=~State,
            values=~state_df[state_df$Claim.Reason=="Collision", "n"],
            type="pie",
            textposition="inside",
            hovertemplate="Reason: Collision<br>State: %{label}<br>Percent: %{percent}<br>Claim Count: %{value}<extra></extra>") %>%
  add_trace(data=state_df[state_df$Claim.Reason=="Hail",],
            labels=~State,
            values=~state_df[state_df$Claim.Reason=="Hail", "n"],
            type="pie",
            textposition="inside",
            hovertemplate="Reason: Hail<br>State: %{label}<br>Percent: %{percent}<br>Claim Count: %{value}<extra></extra>",
            domain=list(
              x=c(0.16,0.84),
              y=c(0.16,0.84))) %>%
  add_trace(data=state_df[state_df$Claim.Reason=="Scratch/Dent",],
            labels=~State,
            values=~state_df[state_df$Claim.Reason=="Scratch/Dent", "n"],
            type="pie",
            textposition="inside",
            hovertemplate="Reason: Scratch/Dent<br>State: %{label}<br>Percent: %{percent}<br>Claim Count: %{value}<extra></extra>",
            domain=list(
              x=c(0.27,0.73),
              y=c(0.27,0.73))) %>%
  add_trace(data=state_df[state_df$Claim.Reason=="Other",],
            labels=~State,
            values=~state_df[state_df$Claim.Reason=="Other", "n"],
            type="pie",
            textposition="inside",
            hovertemplate="Reason: Other<br>State: %{label}<br>Percent: %{percent}<br>Claim Count: %{value}<extra></extra>",
            domain=list(
              x=c(0.35,0.65),
              y=c(0.35,0.65)))

Tab 6

The final visual shows the number of claims filed for each claim reason type. The count distinguishes between gender and is represented through two separate bars per claim reason.

After viewing the visual, it is apparent that females make up the majority of filed auto insurance claims. The only instance where females are surpassed by males is when the claim reason is other. Even though females are consistently higher, the differences between claims for males and females are slim, especially when looking at the claim reasons of hail and scratch/dent. When looking at hail, females filed 1,464 claims while males filed 1,462, resulting in a small difference of 2 claims. For the scratch/dent claim reason, females filed 726 claims while males filed 706 claims, resulting in a minor difference of 20 claims. Therefore, with these small differences, there does not appear to be a strong argument that females are more likely to file claims than males.

reason_df <- df%>%
  group_by(Gender, `Claim Reason`) %>%
  summarise(n=length(Gender), .groups='keep') %>%
  data.frame()

ggplot(reason_df, aes(x=reorder(`Claim.Reason`, -n, sum), y=n, fill=Gender)) +
  geom_bar(stat="identity", position="dodge") +
  theme_light() +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_y_continuous(labels=comma) +
  labs(title="Total Claims by Reason by Gender", x="Claim Reason", y="Claim Count", fill="Gender") +
  scale_fill_brewer(palette="Set2") +
  geom_text(aes(label=scales::comma(n)), position=position_dodge(width=0.9), vjust=-0.4, size=4)

Conclusion

After performing this analysis, the questions listed in the introduction can now be answered. Tab #1 gives insight as to whether certain claims are rejected more than others. Collisions appear to be the most commonly rejected claim reason with a claim count of 3,158. Following collisions, the order from highest number of claim rejections to lowest is hail, scratch/dent, and other.

Tab #3 shows whether there is a relationship between coverage type and claim dollar amounts. No matter the claim reason, premium coverage consistently has the highest average claim dollar amount. Respectively, extended coverage and basic coverage rank next in terms of having the highest average claim dollar amount.

Tab #5 investigates the distribution of states among each claim reason. The pie chart illustrates that each claim reason possesses a similar distribution of states, with Missouri having the highest claim count, followed by Iowa, Nebraska, Oklahoma, and Kansas.

Tab #2 answers the question about whether there is a relationship between coverage type and claim reason. Similar to the previous paragraph looking at state distributions, each coverage type possesses a similar distribution of claim reasons with collisions making up the majority of claims, followed by hail, scratch/dent, and other.

Finally, Tabs #4 and #6 look into whether or not certain educational levels and gender types increase the probability of someone filing an auto insurance claim. The heat map under Tab #4 indicates that those with bachelor’s degrees filed the most claims. From there, the next highest number of filed claims comes from those in college, high school, those with master’s degrees, and doctoral degrees. Tab #6 shows females typically surpassing males in the number of claims filed. However, given that the margins are slim, there does not appear to be enough evidence to prove that females consistently file more claims than males.