DATA 606 Final Project

library(dplyr)
library(DT)
library(tidyr)
library(ggplot2)
library(plotly)

Part 1 - Introduction

What state has worst driver? Is there any correlation between number of accidents and premiums? I chose this topic since I used to have a high auto insurance premium for more than two years, but after I did a re-quote this year. My premiums went down a lot now, I was surprise because as a new driver in past two years I have couple accidents and went to court several times. I would like to know what make the premiums difference between fatal accidents and the losses company payout. I also found out that the bad drivers in your state could affect your premiums as well.

We know that car insurance premiums vary from driver to driver, and there are many variables behind them, such as the driver’s age, driving record, etc. We also pay for higher or permitted insurance based on the state in which we live. The state’s gradual record predicts the state’s average insurance premium.

In this project, I will study some auto insurance-related data from 50 states and Columbia District, select two variables(fatal accident and losses) from the data, and try to use these two variables to predict the average auto insurance premium and compare it with the actual data.

Part 2 - Data

Data Collection

The data is collected from https://github.com/fivethirtyeight/data/blob/master/bad-drivers/bad-drivers.csv

The data set contains the following data: the number of drivers who have had fatal collisions in all 50 states/regions in the United States (per billion miles), the different conditions at the time of these collisions, the car insurance premium (in US dollars), and the insurance company for each insured Loss (USD) caused by the driver’s crash. United States and District of Columbia.

library(knitr)
knitr::opts_chunk$set(echo = TRUE)

# load data
data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv", header = TRUE, stringsAsFactors = FALSE)
head(data)

Cases:

dim(data)

## [1] 51  8

There are 51 cases, 1 for each of 50 states plus 1 for District of Columbia with 8 variables.

Variables:

Our response variable for this project is Car Insurance Premiums ($) and the two chosen explanatory variables are:

Number of drivers involved in fatal collisions per billion miles.
Losses incurred by insurance companies for collisions per insured driver ($).

Type of Study:

The study is observational, the data is collected based on what is seen without any interference.

Scope of inference - generalizability:

The data represents entire population of U.S. drivers. Therefore, the data data is generalizable to the entire U.S. population.

Scope of inference - causality:

The data on collision includes that involves fatality only, where as the losses incurred by the insurance companies per insured drivers represents all type of collisions not just fatal collisions. Also, the difference in insurance premium state by state will be influenced by other factors such as vehicular theft, natural disaster etc.

Part 3 - Exploratory data analysis

To start the analysis, we removed unnecessary columns from the original data and renamed them to facilitate analysis. We have a table with four columns. Here are the names of the columns and their meanings:

State: The name of the state represented by the data.
fatal_accident: The number of drivers involved in fatal collisions per billion miles.
Loss: The loss ($) caused by the insurance company due to each insured driver’s crash.
insurance_premiums: The average car insurance premium in the state ($)

##Retrieve names of columns
names(data)
##Select only required columns
sumdata <- data.frame(data$State, data$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles,  data$Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...., data$Car.Insurance.Premiums....)

Final Data

##Rename required columns
names(sumdata) <- c("state", "fatal_accident", "losses", "insurance_premiums")
datatable(sumdata, options = list(pageLength = 5))

Summary data

summary(sumdata)

##     state           fatal_accident      losses       insurance_premiums
##  Length:51          Min.   : 5.90   Min.   : 82.75   Min.   : 642.0    
##  Class :character   1st Qu.:12.75   1st Qu.:114.64   1st Qu.: 768.4    
##  Mode  :character   Median :15.60   Median :136.05   Median : 859.0    
##                     Mean   :15.79   Mean   :134.49   Mean   : 887.0    
##                     3rd Qu.:18.50   3rd Qu.:151.87   3rd Qu.:1007.9    
##                     Max.   :23.90   Max.   :194.78   Max.   :1301.5

visual represention of data

Number of fatal collisions in each state

ggplot(sumdata,aes(x=reorder(state,fatal_accident), y=fatal_accident) )+
  geom_bar(stat = "identity")+
  xlab("States")+
  ylab("Number of Collisions")+
  ggtitle("Number of fatal collisions in each state")+
  theme(axis.text.x=element_text(angle=90,hjust=0.2,vjust=0.2))

We can see that South Carolina has the highest number of such drivers for every billion miles traveled. There were 23.9 drivers involved in fatal collisions for every billion miles traveled, which were higher than the national average of 15.6.

Average Insurance Premiums

ggplot(sumdata,aes(x=reorder(state,insurance_premiums), y= insurance_premiums) )+
  geom_bar(stat = "identity")+
  coord_flip()+
  xlab("States")+
  ylab("Premiums")+
  ggtitle("Average Insurance Premiums by State")

  theme(axis.text.x=element_text(angle=90,hjust=0.2,vjust=0.2))

## List of 1
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0.2
##   ..$ vjust        : num 0.2
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

State with maximum insurance premium

sumdata %>% 
      filter(insurance_premiums==max(insurance_premiums))

State with minimum insurance premium

sumdata %>% 
      filter(insurance_premiums==min(insurance_premiums))

State with median insurance premium

sumdata %>% 
      filter(insurance_premiums==median(insurance_premiums))

Based on insurance premiums, the highest risk drivers are to be found in New Jersey, where at $1,301, which has the most expensive car insurance in the country. Nationally, the average combined premium was $859.

How much insurance company pay out

ggplot(sumdata,aes(x=reorder(state,losses), y=losses) )+
  geom_bar(stat = "identity")+
  xlab("States")+
  ylab("Amount")+
  ggtitle("Loss Incurred by Insurance Company")+
  theme(axis.text.x=element_text(angle=90,hjust=0.2,vjust=0.2))

The loss incurred by insurance companies for each collisions is another measurement I use to determine which state has the worst drivers, we can see that Louisiana was the most expensive, which was at 194.78. Idaho has the best drivers in U.S.. The average pay out for nationwide was 134.49.

Linear model

m1 <- lm(insurance_premiums ~ fatal_accident, data = sumdata)
summary(m1)

## 
## Call:
## lm(formula = insurance_premiums ~ fatal_accident, data = sumdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -249.23 -136.43  -22.29  133.45  435.28 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1023.354     98.748  10.363 6.08e-14 ***
## fatal_accident   -8.638      6.055  -1.427     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 176.5 on 49 degrees of freedom
## Multiple R-squared:  0.03988,    Adjusted R-squared:  0.02029 
## F-statistic: 2.035 on 1 and 49 DF,  p-value: 0.16

plot(sumdata$insurance_premiums ~ sumdata$fatal_accident)
abline(m1)

cor( sumdata$insurance_premiums, sumdata$fatal_accident)

## [1] -0.1997019

If the number of drivers involved in fatal collisions per billion miles increases by 1 the insurance premium goes down by $8.64, which is surprising. Only 3.99% of the variance found in the response variable (insurance_premiums) can be explained by the explanatory variable (fatal_accident). There is a very weak negative linear relationship between the two variables.

m2 <- lm(insurance_premiums ~ losses, data = sumdata)
summary(m2)

## 
## Call:
## lm(formula = insurance_premiums ~ losses, data = sumdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -213.33  -96.75  -40.11  112.24  379.97 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 285.3251   109.6689   2.602   0.0122 *  
## losses        4.4733     0.8021   5.577 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 140.9 on 49 degrees of freedom
## Multiple R-squared:  0.3883, Adjusted R-squared:  0.3758 
## F-statistic:  31.1 on 1 and 49 DF,  p-value: 1.043e-06

plot(sumdata$insurance_premiums ~ sumdata$losses)
abline(m2)

cor( sumdata$insurance_premiums, sumdata$losses)

## [1] 0.6231164

Every dollar increase in losses incurred by the insurance companies the insurance premium goes up by $4.4733, roughly 38.83% of the of the variance found in the response variable (insurance_premiums) can be explained by this predictor variable (losses). There is a moderate positive linear relationship between the two variables

Research question

Now, to answer our research question we will try to predict the insurance premium of three states with highest, lowest and median insurance premium by using the two chosen variables for the states; fatal_accident (number of drivers involved in fatal collisions per billion miles) and losses (losses incurred by insurance companies for collisions per insured driver).

For this project I will analyze the data from three stats with maximum, minimum and median average insurance premiums.

To predict the insurance premiums we will use the Least Square Regression Line Equation:

\[ \hat{y} = \beta_0 + \beta_1x \]

Where,

$\beta_1$ = The slope of the regression line

$\beta_0$ = The intercept point of the regression line and the y axis.

Estimate New Jersey average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{New Jersey} = 926.582 \]

Our model under estimates the insurance premium by 374.938

Estimate Idaho average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{Idaho} = 891.158 \]

Our model over estimates the insurance premium by 249.198

Estimate South Carolina average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{South Carolina} = 816.854 \]

Our model under estimates the insurance premium by -67.612

Estimate New Jersey average insurance premium by looking at losses incurred by insurance companies for collisions per insured driver ($):

\[ \hat{New Jersey} = 1000.33405 \]

Our model under estimates the insurance premium by 301.18595

Estimate Idaho average insurance premium by looking at losses incurred by insurance companies for collisions per insured driver ($):

\[ \hat{Idaho} = 655.46575 \]

Our model over estimates the insurance premium by 13.50575

Estimate South Carolina average insurance premium by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{South Carolina} = 805.49017 \]

Our model under estimates the insurance premium by 53.47983

Part 4 - Inference

Conditions:The conditions for inference was satisfied, since the sample size is larger than 30, and the dataset of premiums follow normal distribution, and each sample is independent.

H0: There is no association between Car insurance premiums and the variables of losses or fatal_accident.
HA: There is an association with at least one of the variables of losses or fatal_accident.

infer <- aov(sumdata$fatal_accident ~ sumdata$insurance_premiums)
summary(infer)

##                            Df Sum Sq Mean Sq F value Pr(>F)
## sumdata$insurance_premiums  1   33.9   33.88   2.035   0.16
## Residuals                  49  815.7   16.65

Since the P-value is very large, which is more than 0.05, so we have sufficient evident to reject the null hypothesis, and favor the alternative hypothesis.

Further more, we run a linear regression mode. The intercept of 1023.354 represents the starting premiums amount, and the estimate of the number of collision was negative.

reg<-lm(formula = insurance_premiums ~ fatal_accident , data = sumdata)
summary(reg)

## 
## Call:
## lm(formula = insurance_premiums ~ fatal_accident, data = sumdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -249.23 -136.43  -22.29  133.45  435.28 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1023.354     98.748  10.363 6.08e-14 ***
## fatal_accident   -8.638      6.055  -1.427     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 176.5 on 49 degrees of freedom
## Multiple R-squared:  0.03988,    Adjusted R-squared:  0.02029 
## F-statistic: 2.035 on 1 and 49 DF,  p-value: 0.16

Part 5 - Conclusion

In this project I have tried to estimate average insurance premium by states from two variables from my initial data set using linear regression model and compare it with actual value. One of my variable, “losses incurred by insurance companies for collisions per insured driver ($)” came closest to actual value but still the difference is too high.

For future research, I would like to analyze other factors like the driver’s age, driving record to see how they affect the premiums.

References

Data explanation: http://fivethirtyeight.com/datalab/which-state-has-the-worst-drivers/
Data source: https://github.com/fivethirtyeight/data/blob/master/bad-drivers/bad-drivers.csv