library(dplyr)
library(DT)
library(tidyr)
library(ggplot2)
library(plotly)
What state has worst driver? Is there any correlation between number of accidents and premiums? I chose this topic since I used to have a high auto insurance premium for more than two years, but after I did a re-quote this year. My premiums went down a lot now, I was surprise because as a new driver in past two years I have couple accidents and went to court several times. I would like to know what make the premiums difference between fatal accidents and the losses company payout. I also found out that the bad drivers in your state could affect your premiums as well.
We know that car insurance premiums vary from driver to driver, and there are many variables behind them, such as the driver’s age, driving record, etc. We also pay for higher or permitted insurance based on the state in which we live. The state’s gradual record predicts the state’s average insurance premium.
In this project, I will study some auto insurance-related data from 50 states and Columbia District, select two variables(fatal accident and losses) from the data, and try to use these two variables to predict the average auto insurance premium and compare it with the actual data.
The data is collected from https://github.com/fivethirtyeight/data/blob/master/bad-drivers/bad-drivers.csv
The data set contains the following data: the number of drivers who have had fatal collisions in all 50 states/regions in the United States (per billion miles), the different conditions at the time of these collisions, the car insurance premium (in US dollars), and the insurance company for each insured Loss (USD) caused by the driver’s crash. United States and District of Columbia.
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
# load data
data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv", header = TRUE, stringsAsFactors = FALSE)
head(data)
dim(data)
## [1] 51 8
There are 51 cases, 1 for each of 50 states plus 1 for District of Columbia with 8 variables.
Our response variable for this project is Car Insurance Premiums ($) and the two chosen explanatory variables are:
Number of drivers involved in fatal collisions per billion miles.
Losses incurred by insurance companies for collisions per insured driver ($).
The study is observational, the data is collected based on what is seen without any interference.
The data represents entire population of U.S. drivers. Therefore, the data data is generalizable to the entire U.S. population.
The data on collision includes that involves fatality only, where as the losses incurred by the insurance companies per insured drivers represents all type of collisions not just fatal collisions. Also, the difference in insurance premium state by state will be influenced by other factors such as vehicular theft, natural disaster etc.
To start the analysis, we removed unnecessary columns from the original data and renamed them to facilitate analysis. We have a table with four columns. Here are the names of the columns and their meanings:
State: The name of the state represented by the data.
fatal_accident: The number of drivers involved in fatal collisions per billion miles.
Loss: The loss ($) caused by the insurance company due to each insured driver’s crash.
insurance_premiums: The average car insurance premium in the state ($)
##Retrieve names of columns
names(data)
##Select only required columns
sumdata <- data.frame(data$State, data$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles, data$Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...., data$Car.Insurance.Premiums....)
Final Data
##Rename required columns
names(sumdata) <- c("state", "fatal_accident", "losses", "insurance_premiums")
datatable(sumdata, options = list(pageLength = 5))
Summary data
summary(sumdata)
## state fatal_accident losses insurance_premiums
## Length:51 Min. : 5.90 Min. : 82.75 Min. : 642.0
## Class :character 1st Qu.:12.75 1st Qu.:114.64 1st Qu.: 768.4
## Mode :character Median :15.60 Median :136.05 Median : 859.0
## Mean :15.79 Mean :134.49 Mean : 887.0
## 3rd Qu.:18.50 3rd Qu.:151.87 3rd Qu.:1007.9
## Max. :23.90 Max. :194.78 Max. :1301.5
visual represention of data
ggplot(sumdata,aes(x=reorder(state,fatal_accident), y=fatal_accident) )+
geom_bar(stat = "identity")+
xlab("States")+
ylab("Number of Collisions")+
ggtitle("Number of fatal collisions in each state")+
theme(axis.text.x=element_text(angle=90,hjust=0.2,vjust=0.2))
ggplot(sumdata,aes(x=reorder(state,losses), y=losses) )+
geom_bar(stat = "identity")+
xlab("States")+
ylab("Amount")+
ggtitle("Loss Incurred by Insurance Company")+
theme(axis.text.x=element_text(angle=90,hjust=0.2,vjust=0.2))
Linear model
m1 <- lm(insurance_premiums ~ fatal_accident, data = sumdata)
summary(m1)
##
## Call:
## lm(formula = insurance_premiums ~ fatal_accident, data = sumdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249.23 -136.43 -22.29 133.45 435.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1023.354 98.748 10.363 6.08e-14 ***
## fatal_accident -8.638 6.055 -1.427 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176.5 on 49 degrees of freedom
## Multiple R-squared: 0.03988, Adjusted R-squared: 0.02029
## F-statistic: 2.035 on 1 and 49 DF, p-value: 0.16
plot(sumdata$insurance_premiums ~ sumdata$fatal_accident)
abline(m1)
cor( sumdata$insurance_premiums, sumdata$fatal_accident)
## [1] -0.1997019
If the number of drivers involved in fatal collisions per billion miles increases by 1 the insurance premium goes down by $8.64, which is surprising. Only 3.99% of the variance found in the response variable (insurance_premiums) can be explained by the explanatory variable (fatal_accident). There is a very weak negative linear relationship between the two variables.
m2 <- lm(insurance_premiums ~ losses, data = sumdata)
summary(m2)
##
## Call:
## lm(formula = insurance_premiums ~ losses, data = sumdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.33 -96.75 -40.11 112.24 379.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 285.3251 109.6689 2.602 0.0122 *
## losses 4.4733 0.8021 5.577 1.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140.9 on 49 degrees of freedom
## Multiple R-squared: 0.3883, Adjusted R-squared: 0.3758
## F-statistic: 31.1 on 1 and 49 DF, p-value: 1.043e-06
plot(sumdata$insurance_premiums ~ sumdata$losses)
abline(m2)
cor( sumdata$insurance_premiums, sumdata$losses)
## [1] 0.6231164
Every dollar increase in losses incurred by the insurance companies the insurance premium goes up by $4.4733, roughly 38.83% of the of the variance found in the response variable (insurance_premiums) can be explained by this predictor variable (losses). There is a moderate positive linear relationship between the two variables
Research question
Now, to answer our research question we will try to predict the insurance premium of three states with highest, lowest and median insurance premium by using the two chosen variables for the states; fatal_accident (number of drivers involved in fatal collisions per billion miles) and losses (losses incurred by insurance companies for collisions per insured driver).
For this project I will analyze the data from three stats with maximum, minimum and median average insurance premiums.
To predict the insurance premiums we will use the Least Square Regression Line Equation:
\[ \hat{y} = \beta_0 + \beta_1x \]
Where,
\(\beta_1\) = The slope of the regression line
\(\beta_0\) = The intercept point of the regression line and the y axis.
Estimate New Jersey average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:
\[ \hat{New Jersey} = 926.582 \]
Our model under estimates the insurance premium by 374.938
Estimate Idaho average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:
\[ \hat{Idaho} = 891.158 \]
Our model over estimates the insurance premium by 249.198
Estimate South Carolina average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:
\[ \hat{South Carolina} = 816.854 \]
Our model under estimates the insurance premium by -67.612
Estimate New Jersey average insurance premium by looking at losses incurred by insurance companies for collisions per insured driver ($):
\[ \hat{New Jersey} = 1000.33405 \]
Our model under estimates the insurance premium by 301.18595
Estimate Idaho average insurance premium by looking at losses incurred by insurance companies for collisions per insured driver ($):
\[ \hat{Idaho} = 655.46575 \]
Our model over estimates the insurance premium by 13.50575
Estimate South Carolina average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:
\[ \hat{South Carolina} = 805.49017 \]
Our model under estimates the insurance premium by 53.47983
Conditions:The conditions for inference was satisfied, since the sample size is larger than 30, and the dataset of premiums follow normal distribution, and each sample is independent.
H0: There is no association between Car insurance premiums and the variables of losses or fatal_accident.
HA: There is an association with at least one of the variables of losses or fatal_accident.
infer <- aov(sumdata$fatal_accident ~ sumdata$insurance_premiums)
summary(infer)
## Df Sum Sq Mean Sq F value Pr(>F)
## sumdata$insurance_premiums 1 33.9 33.88 2.035 0.16
## Residuals 49 815.7 16.65
Further more, we run a linear regression mode. The intercept of 1023.354 represents the starting premiums amount, and the estimate of the number of collision was negative.
reg<-lm(formula = insurance_premiums ~ fatal_accident , data = sumdata)
summary(reg)
##
## Call:
## lm(formula = insurance_premiums ~ fatal_accident, data = sumdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249.23 -136.43 -22.29 133.45 435.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1023.354 98.748 10.363 6.08e-14 ***
## fatal_accident -8.638 6.055 -1.427 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176.5 on 49 degrees of freedom
## Multiple R-squared: 0.03988, Adjusted R-squared: 0.02029
## F-statistic: 2.035 on 1 and 49 DF, p-value: 0.16
In this project I have tried to estimate average insurance premium by states from two variables from my initial data set using linear regression model and compare it with actual value. One of my variable, “losses incurred by insurance companies for collisions per insured driver ($)” came closest to actual value but still the difference is too high.
For future research, I would like to analyze other factors like the driver’s age, driving record to see how they affect the premiums.