Introduction

The data set, Auto Insurance Claims, details information for customers belonging to the following Midwest states: Missouri, Iowa, Nebraska, Oklahoma, and Kansas. The data was accumulated from January 1, 2011 to February 28, 2011. In addition to providing information about the customers themselves, the data set holds information about customer vehicles, customer insurance, and customer claims. As a result, we are able to analyze relationships and distributions of various factors for a greater understanding of vehicle insurance claims and the population of people utilizing auto insurance.

Descriptive Statistics

Auto_Insurance_Claims_Sample <- read.csv("C:/Users/eshinh/DataVizFiles/Auto_Insurance_Claims_Sample.csv")
df = data.frame(Auto_Insurance_Claims_Sample)
summary(df)
##    Customer           Country           State.Code           State          
##  Length:9134        Length:9134        Length:9134        Length:9134       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Claim.Amount      Response           Coverage          Education        
##  Min.   : 189.8   Length:9134        Length:9134        Length:9134       
##  1st Qu.: 399.4   Class :character   Class :character   Class :character  
##  Median : 578.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 800.5                                                           
##  3rd Qu.: 896.2                                                           
##  Max.   :8332.5                                                           
##  Effective.To.Date  EmploymentStatus      Gender              Income     
##  Length:9134        Length:9134        Length:9134        Min.   :    0  
##  Class :character   Class :character   Class :character   1st Qu.:    0  
##  Mode  :character   Mode  :character   Mode  :character   Median :33890  
##                                                           Mean   :37657  
##                                                           3rd Qu.:62320  
##                                                           Max.   :99981  
##  Location.Code      Marital.Status     Monthly.Premium.Auto
##  Length:9134        Length:9134        Min.   : 61.00      
##  Class :character   Class :character   1st Qu.: 68.00      
##  Mode  :character   Mode  :character   Median : 83.00      
##                                        Mean   : 93.22      
##                                        3rd Qu.:109.00      
##                                        Max.   :298.00      
##  Months.Since.Last.Claim Months.Since.Policy.Inception
##  Min.   : 0.0            Min.   : 0.00                
##  1st Qu.: 6.0            1st Qu.:24.00                
##  Median :14.0            Median :48.00                
##  Mean   :15.1            Mean   :48.06                
##  3rd Qu.:23.0            3rd Qu.:71.00                
##  Max.   :35.0            Max.   :99.00                
##  Number.of.Open.Complaints Number.of.Policies Policy.Type       
##  Min.   :0.0000            Min.   :1.000      Length:9134       
##  1st Qu.:0.0000            1st Qu.:1.000      Class :character  
##  Median :0.0000            Median :2.000      Mode  :character  
##  Mean   :0.3844            Mean   :2.966                        
##  3rd Qu.:0.0000            3rd Qu.:4.000                        
##  Max.   :5.0000            Max.   :9.000                        
##     Policy          Claim.Reason       Sales.Channel      Total.Claim.Amount
##  Length:9134        Length:9134        Length:9134        Min.   :   0.099  
##  Class :character   Class :character   Class :character   1st Qu.: 272.258  
##  Mode  :character   Mode  :character   Mode  :character   Median : 383.945  
##                                                           Mean   : 434.089  
##                                                           3rd Qu.: 547.515  
##                                                           Max.   :2893.240  
##  Vehicle.Class      Vehicle.Size      
##  Length:9134        Length:9134       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
nrow(df)
## [1] 9134

The data is made up of 18 qualitative variables and 8 quantitative variables. There are no missing values in the data frame. There are 9,134 rows of data.

Correlation Predictions

Looking at the quantitative data, there is expected to be an inverse relationship between income and total claim amount, a direct relationship between monthly insurance premiums and total claim amount, and a direct relationship between months since last claim and total claim amount.

cor(df$Income, df$Total.Claim.Amount)
## [1] -0.3552543

Correlation between income and total claim amount.

cor(df$Total.Claim.Amount, df$Monthly.Premium.Auto)
## [1] 0.6320167

Correlation between monthly premium and total claim amount.

cor(df$Total.Claim.Amount, df$Months.Since.Last.Claim)
## [1] 0.007562974

Correlation between months since last claim and total claim amount.

Correlation Results

The correlation coefficients from each of these predictions exemplify that there is a weak, negative relationship between income and total claim amount, there is a moderate, positive relationship between monthly premiums and total claim amount, and there is little to no relationship between months since last claim and total claim amount.

Visualizations

Below you will find graphs showcasing the relationships between both qualitative and quantitative data from the dataset.

Tab 1

library(plyr)
library(dplyr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(lubridate)

df_income = df %>%
  select(State, Income, Vehicle.Class) %>%
  group_by(Vehicle.Class, State) %>%
  summarize(avg = mean(Income), .groups = 'keep') %>%
  data.frame()

mylevels = c('Two-Door Car', 'Four-Door Car', 'Sports Car', 'Luxury Car',
             'SUV', 'Luxury SUV')
df_income$Vehicle.Class = factor(df_income$Vehicle.Class, levels = mylevels)
ggplot(df_income, aes(x = Vehicle.Class, y = State, fill = avg)) +
  geom_tile(color = "black") +
  geom_text(aes(label = dollar(round(avg)))) +
  coord_equal(ratio = 1) +
  labs(title = "Heatmap: Average Income by Vehicle Class and State",
       x = "Vehicle Class",
       y = "States",
       fill = "Income") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_continuous(low = "white", high = "red", labels = dollar) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(color = "black")))

This heatmap provides insight into the auto insurance customer base in the data set by showcasing the average income distribution across states and vehicle classes. The graph highlights two key pieces of data: customers with the highest average income own a luxury car from Nebraska, and customers with the lowest average income own a luxury car from Kansas. This may indicate that Kansas has a lower cost of living if lower income customers are owning luxury cars. Simultaneously, this shows that Nebraska has a higher cost of living among Midwest states. The left side of the heatmap displays darker shades of red; this suggests that higher income customers tend to own two or four-door cars across all Midwest states. On the other hand, the right side of the heatmap displays lighter shades of red, implying that lower income customers own luxury SUVs, particularly customers who are residents of Oklahoma, Missouri, or Kansas.

Tab 2

df_states = count(df, State)
df_states = df_states[order(df_states$n, decreasing = TRUE),]
states = df_states$State

new_df = df %>%
  filter(State %in% states) %>%
  select(Coverage, State) %>%
  group_by(State, Coverage) %>%
  summarize(n = length(State), .groups = 'keep') %>%
  data.frame()

agg_tot = new_df %>%
  select(State, n) %>%
  group_by(State) %>%
  summarize(tot = sum(n), .groups = 'keep') %>%
  data.frame()

max_y = round_any(max(agg_tot$tot), 1000, ceiling)
ggplot(new_df, aes(x = reorder(State, n, sum), y = n, fill = Coverage)) + 
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
  coord_flip() +
  labs(title = "Auto Insurance Claim Count by State", x = "", y = "Auto Insurance Claim Count", fill = "Coverage") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Set1") +
  geom_text(data = agg_tot, aes(x = State, y = tot, label = scales::comma(tot), fill = NULL), hjust = -0.1, size = 4) +
  scale_y_continuous(labels = comma, limits = c(0, max_y))

The stacked horizontal bar chart presents the amount of vehicle insurance claims made by each state and further identifies the coverage breakdown of the claims. Within the Midwest states, Missouri customers had the most claims and Kansas customers had the least amount of claims between 1/1/2011 and 2/28/2011. This difference in claims can be attributed to the lack of traffic density in Kansas. It is clear that there is a trend regarding the proportion of coverage plans, with a majority of customers making claims in each Midwest state possessing basic coverage and a small amount of customers making claims possessing premium coverage. In addition to basic plans being more affordable and accessible, people within Midwest states opt for basic coverage because rural areas tend to have a lower cost of living, less accidents, and lower traffic density. However, it is clear that basic plans are not providing enough coverage given the amount of claims made by those customers.

Tab 3

df_claims = df %>%
  select(Claim.Reason, State, Claim.Amount) %>%
  group_by(State, Claim.Reason) %>%
  summarize(tot = sum(Claim.Amount), .groups = 'keep') %>%
  data.frame()

state_order = factor(df_claims$State, level = c('Missouri', 'Iowa', 'Nebraska', 'Oklahoma', 'Kansas'))
claim_order = factor(df_claims$Claim.Reason, level = c('Scratch/Dent', 'Hail', 'Collision', 'Other'))
ggplot(df_claims, aes(x = claim_order, y = tot, group = State)) +
  geom_line(aes(color = State), size = 3) +
  labs(title = "Auto Insurance Claim Amount by Reason and by State", x = "Claim Reason", y = "Claim Amount") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(shape = 21, size = 5, color = "black", fill = "white") +
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_color_brewer(palette = "Set1")

The line plot demonstrates the total amount of money insurance companies have paid to policyholders based on the reason for the claim. The plot further shows the difference in these claim amounts by state. There is a pattern across all Midwest states; insurance companies pay the most for collision claims, followed by hail, scratches and dents, and the least amount for claims falling in the “other” category. Similarly, the graph highlights how insurance companies pay Missouri policyholders the most for each claim category and pay Kansas policyholders the least for each. This can be explained by the difference in legal environments for each state. In Missouri, liability laws and damage awards are more favorable to plaintiffs. Kansas, on the other hand, has more conservative policies when it comes to handling claims. The order in which states are presented in this graph from most expensive claim amounts to least expensive follows that of the previous graph, which indicates that the number of claims made by each state also plays a role in the claim amount since Missouri had the most claims and Kansas had the least amount of claims between 1/1/2011 and 2/28/2011.

Tab 4

df_policy = df %>%
  select(Policy, Coverage) %>%
  group_by(Policy, Coverage) %>%
  summarize(n=length(Policy), .groups = 'keep') %>%
  group_by(Policy) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()

ggplot(data = df_policy, aes(x = "", y = n, fill = Coverage)) +
  geom_bar(stat = "identity", position = "fill") +
  coord_polar(theta = "y", start = 0) +
  labs(fill = "Coverage Type", x = NULL, y = NULL, title = "Auto Insurance Claims by Policy and by Coverage") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~Policy, ncol = 3, nrow = 3) + 
  scale_fill_brewer(palette = "PuRd") +
  geom_text(aes(x = 1.8, label = paste0(percent_of_total, "%")),
            size = 2.5,
            position = position_fill(vjust = 0.5))

The pie charts exhibit the distributions of plan coverage types across different auto insurance policies. For each policy and level, more than half of the customers who made claims held a basic plan, approximately one-third of customers held an extended plan, and about one-tenth of customers held a premium plan. These results stay consistent with that of the “Auto Insurance Claim Count by State” stacked bar chart, which revealed that majority of customers with claims had basic plans, some had extended plans, and fewer had premium plans. While the coverage distributions remain similar across each policy, more customers with claims under the special level 3 policy had basic coverage and less had premium coverage. Special level 3 auto insurance policies often come with higher premiums and offer a higher tier of coverage. As a result, customers may want to save money by choosing basic coverage, or they may feel as though they do not need higher coverage given that the policy is high tier by default. However, the pie charts reveal that premium coverage resulted in less claims for policyholders of special level 3.

Tab 5

dates_converted = format(mdy(df$Effective.To.Date), "%m/%d/%Y")
df$Effective.To.Date = dates_converted

df_dates = df %>%
  select(Effective.To.Date) %>%
  mutate(months = months(mdy(Effective.To.Date), abbreviate = TRUE),
         days = day(mdy(Effective.To.Date))) %>%
  group_by(months, days) %>%
  summarize(n = length(Effective.To.Date), .groups = 'keep') %>%
  data.frame()

df_dates$days = factor(df_dates$days)

df_dates$months = factor(df_dates$months, levels = c('Jan', 'Feb'))

ggplot(df_dates, aes(x = days, y = n, fill = months)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_y_continuous(labels = comma) +
  labs(title = "Multiple Bar Charts - Total Auto Insurance Claims by Day and Month", 
       x = "Day", 
       y = "Claim Count",
       fill = "Month") + 
  scale_fill_brewer(palette = "Set2") +
  facet_wrap(~months, ncol = 2, nrow = 1) +
  scale_x_discrete(breaks = df_dates$days[seq(1,length(df_dates$days), by = 2)])

These multiple bar charts present how the number of auto insurance claims rose and fell over the course of each day in January and February of 2011. On January 4th, the number of claims fell after people had finished traveling from the New Year holiday. It appears that Mondays tended to cause a spike in January claims; this is seen on the dates January 10th, 17th, and 31st. The spike on January 26th and 27th is attributed to a major winter storm that created hazardous weather conditions in the Midwest.

The month of February appeared to have a more consistent number of auto insurance claims each day. The only major spike seen in the graph was on February 14th: Valentine’s Day. The number of claims fell again after the holiday, similar to what was seen after New Year’s weekend in the previous month.

Conclusion

The visualizations created to describe and analyze the auto insurance claims data set show how different factors affect insurance claims, such as coverage type, location, claim reason, day of the month, and policy type. The graphs ultimately revealed that Midwest policyholders with the following attributes have more auto insurance claims: Missouri residents, basic coverage, and/or driving during significant dates.