Car insurance premiums are varied among different states in America. Why are some states having higher premiums than others? In this project, I wanted to study the association between driver records and car insurance premiums. For those who are planning to move to another state, it’s good to know the percentage of bad drivers and your potential car insurance premium before you make the decision.
The data is collected from National Highway Traffic Safety Administration and National Association of Insurance Commissioners by FiveThirtyEight.
There are 51 cases/observations in the given data set(inlcuding District of Columbia). Each case represents a state in the United States. I will be studying the Insurance Loss and Insurance Premium variables. Insurance Loss is the independent variable that is quantitative and Insurance Premium is the response variable that is also quantitative.
This is an observational study. I will draw my conclusions based on analyzing the existing data.
The population of interest is anyone who drives in the United States, hence this study can be generalized to the general population.
The data may or may not be used to establish causal links between Insurance Loss and Insurance Premiums, since there could be more factors taken into consideration when the premiums were being priced.
library(ggplot2)
library(dplyr)
library(DT)
datatable(drivers)
First we looked at the summary statistics for characteristics of bad drivers.
summary(drivers)
## State Collisions Perc.Speeding Perc.Alcohol
## Alabama : 1 Min. : 5.90 Min. :13.00 Min. :16.00
## Alaska : 1 1st Qu.:12.75 1st Qu.:23.00 1st Qu.:28.00
## Arizona : 1 Median :15.60 Median :34.00 Median :30.00
## Arkansas : 1 Mean :15.79 Mean :31.73 Mean :30.69
## California: 1 3rd Qu.:18.50 3rd Qu.:38.00 3rd Qu.:33.00
## Colorado : 1 Max. :23.90 Max. :54.00 Max. :44.00
## (Other) :45
## Perc.Not.Distracted Perc.No.Pre.Accident Insurance.Premium
## Min. : 10.00 Min. : 76.00 Min. : 642.0
## 1st Qu.: 83.00 1st Qu.: 83.50 1st Qu.: 768.4
## Median : 88.00 Median : 88.00 Median : 859.0
## Mean : 85.92 Mean : 88.73 Mean : 887.0
## 3rd Qu.: 95.00 3rd Qu.: 95.00 3rd Qu.:1007.9
## Max. :100.00 Max. :100.00 Max. :1301.5
##
## Insurance.Loss
## Min. : 82.75
## 1st Qu.:114.64
## Median :136.05
## Mean :134.49
## 3rd Qu.:151.87
## Max. :194.78
##
Then we looked at a barplot to see car insurance premiums in all the states ranking from highest to lowest.
drivers %>% ggplot(aes(x=reorder(State, -Insurance.Premium), y=Insurance.Premium, color=Insurance.Premium)) +
geom_bar(stat = "identity") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
xlab("States") + ylab("Car Insurance Premium") +
ggtitle("Car Insurance Premium by States")
We also looked at another barplot to see insurance losses in all the states ranking from highest to lowest.
drivers %>% ggplot(aes(x=reorder(State, -Insurance.Loss), y=Insurance.Loss, color=Insurance.Loss)) +
geom_bar(stat = "identity") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
xlab("States") + ylab("Insurance Losses") +
ggtitle("Insurance Losses by States")
Idaho is the state that has both the lowest insurance loss and premium. On the other hand, New Jersey has the highest premium of all the states but it doesn’t have the most losses; in fact, it’s only ranked as the 6th.
Finally, we looked at the relationship between insurance losses and premiums from a scatterplot.
drivers %>% ggplot(aes(x=Insurance.Loss, y=Insurance.Premium, color=Insurance.Premium, size=Insurance.Loss)) +
geom_point() +
guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
xlab("Insurance Losses") + ylab("Car Insurance Premium") +
ggtitle("Car Insurance Premium by Insurance Lossses")
There is an upward movement towards the upper right corner which indicates higher losses lead to higher premium. However, the points don’t exactly follow a straight line.
For this dataset, we have a sample size larger than 30 with independent observations. The distribution does not follow a normal distribution and right skewed. The conditions for inference have not been met.
hist(drivers$Insurance.Premium)
Running a linear regression model:
m_loss <- lm(Insurance.Premium ~ Insurance.Loss, data = drivers)
summary(m_loss)
##
## Call:
## lm(formula = Insurance.Premium ~ Insurance.Loss, data = drivers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.33 -96.75 -40.11 112.24 379.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 285.3251 109.6689 2.602 0.0122 *
## Insurance.Loss 4.4733 0.8021 5.577 1.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140.9 on 49 degrees of freedom
## Multiple R-squared: 0.3883, Adjusted R-squared: 0.3758
## F-statistic: 31.1 on 1 and 49 DF, p-value: 1.043e-06
The linear regression model suggests that the formula used to predict the insurance premium by loss is:
\[\hat{premium} = 285.33 + 4.47 * loss\]
From this study, I would conclude that there appears to be association between car insurance losses and insurance premiums, but I would not say there is a causal link between the two variables. We do not know how the car insurance premiums were calculated and how much weight does each factor have. Until further analysis can be done, at this point we can only be certain that Idaho is the safest state which also has the lowest car insurance premium.