Part 1 - Introduction

Car insurance premiums are varied among different states in America. Why are some states having higher premiums than others? In this project, I wanted to study the association between driver records and car insurance premiums. For those who are planning to move to another state, it’s good to know the percentage of bad drivers and your potential car insurance premium before you make the decision.

Part 2 - Data

Part 3 - Exploratory data analysis

library(ggplot2)
library(dplyr)
library(DT)
datatable(drivers)

First we looked at the summary statistics for characteristics of bad drivers.

summary(drivers)
##         State      Collisions    Perc.Speeding    Perc.Alcohol  
##  Alabama   : 1   Min.   : 5.90   Min.   :13.00   Min.   :16.00  
##  Alaska    : 1   1st Qu.:12.75   1st Qu.:23.00   1st Qu.:28.00  
##  Arizona   : 1   Median :15.60   Median :34.00   Median :30.00  
##  Arkansas  : 1   Mean   :15.79   Mean   :31.73   Mean   :30.69  
##  California: 1   3rd Qu.:18.50   3rd Qu.:38.00   3rd Qu.:33.00  
##  Colorado  : 1   Max.   :23.90   Max.   :54.00   Max.   :44.00  
##  (Other)   :45                                                  
##  Perc.Not.Distracted Perc.No.Pre.Accident Insurance.Premium
##  Min.   : 10.00      Min.   : 76.00       Min.   : 642.0   
##  1st Qu.: 83.00      1st Qu.: 83.50       1st Qu.: 768.4   
##  Median : 88.00      Median : 88.00       Median : 859.0   
##  Mean   : 85.92      Mean   : 88.73       Mean   : 887.0   
##  3rd Qu.: 95.00      3rd Qu.: 95.00       3rd Qu.:1007.9   
##  Max.   :100.00      Max.   :100.00       Max.   :1301.5   
##                                                            
##  Insurance.Loss  
##  Min.   : 82.75  
##  1st Qu.:114.64  
##  Median :136.05  
##  Mean   :134.49  
##  3rd Qu.:151.87  
##  Max.   :194.78  
## 

Then we looked at a barplot to see car insurance premiums in all the states ranking from highest to lowest.

drivers %>% ggplot(aes(x=reorder(State, -Insurance.Premium), y=Insurance.Premium, color=Insurance.Premium)) +
  geom_bar(stat = "identity") +
  guides(fill = FALSE) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  xlab("States") + ylab("Car Insurance Premium") +
  ggtitle("Car Insurance Premium by States")

We also looked at another barplot to see insurance losses in all the states ranking from highest to lowest.

drivers %>% ggplot(aes(x=reorder(State, -Insurance.Loss), y=Insurance.Loss, color=Insurance.Loss)) +
  geom_bar(stat = "identity") +
  guides(fill = FALSE) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  xlab("States") + ylab("Insurance Losses") +
  ggtitle("Insurance Losses by States")

Idaho is the state that has both the lowest insurance loss and premium. On the other hand, New Jersey has the highest premium of all the states but it doesn’t have the most losses; in fact, it’s only ranked as the 6th.

Finally, we looked at the relationship between insurance losses and premiums from a scatterplot.

drivers %>% ggplot(aes(x=Insurance.Loss, y=Insurance.Premium, color=Insurance.Premium, size=Insurance.Loss)) +
  geom_point() +
  guides(fill = FALSE) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  xlab("Insurance Losses") + ylab("Car Insurance Premium") +
  ggtitle("Car Insurance Premium by Insurance Lossses")

There is an upward movement towards the upper right corner which indicates higher losses lead to higher premium. However, the points don’t exactly follow a straight line.

Part 4 - Inference

For this dataset, we have a sample size larger than 30 with independent observations. The distribution does not follow a normal distribution and right skewed. The conditions for inference have not been met.

hist(drivers$Insurance.Premium)

Running a linear regression model:

m_loss <- lm(Insurance.Premium ~ Insurance.Loss, data = drivers)
summary(m_loss)
## 
## Call:
## lm(formula = Insurance.Premium ~ Insurance.Loss, data = drivers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -213.33  -96.75  -40.11  112.24  379.97 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    285.3251   109.6689   2.602   0.0122 *  
## Insurance.Loss   4.4733     0.8021   5.577 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 140.9 on 49 degrees of freedom
## Multiple R-squared:  0.3883, Adjusted R-squared:  0.3758 
## F-statistic:  31.1 on 1 and 49 DF,  p-value: 1.043e-06

The linear regression model suggests that the formula used to predict the insurance premium by loss is:

\[\hat{premium} = 285.33 + 4.47 * loss\]

Part 5 - Conclusion

From this study, I would conclude that there appears to be association between car insurance losses and insurance premiums, but I would not say there is a causal link between the two variables. We do not know how the car insurance premiums were calculated and how much weight does each factor have. Until further analysis can be done, at this point we can only be certain that Idaho is the safest state which also has the lowest car insurance premium.

References

https://github.com/fivethirtyeight/data/tree/master/bad-drivers