# DO NOT REMOVE
# THIS IS FOR SETTING SOME PLOTTING PARAMETERS SO THAT YOUR PLOTS DON'T TAKE UP TOO MUCH SPACE
# IF YOU WOULD LIKE TO CHANGE THESE, SEE HELP FILES ON THE par() FUNCTION
# OR ASK FOR HELP
library(knitr)
## set global chunk options
opts_chunk$set(fig.path='figure/manual-', cache.path='cache/manual-', fig.align='center', fig.show='hold', par=TRUE)
## tune details of base graphics (http://yihui.name/knitr/hooks)
knit_hooks$set(par=function(before, options, envir){
if (before && options$fig.show!='none') par(mar=c(4,4,.2,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3)
})
# load data
data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv", header = TRUE, stringsAsFactors = FALSE)
library(dplyr)
library(DT)
library(tidyr)
library(ggplot2)
library(plotly)

Part 1 - Introduction:

Cars are important part of modern life. While for some cars are a mean to get to A to B, for others cars are a necessity to conduct their day to day business. Driving car has become such an integral part of our lives that sometimes we forget the dangers that are associated with it. Accidents can happen even to the most careful among us and when accidents do happen car insurance is the only thing that can save us from a lot of personal liabilities which we are otherwise accountable for.

We know car Insurance premium varies from driver to driver and there are lots of variables at work behind that, for example: age of the driver, driving record etc. We also pay higher or lower insurance depending on which state we live in. Can we predict the average insurance premiums for states from the driving record of that state.

In this project we will look at some car insurance related data from each of the 50 states and D.C., pick two variables from the data and try to predict average car insurance premium using those two variables and compare them with actual data. (Note: Research question was changed as instructed by instructor of the course, Dr. Jason Bryer)

Part 2 - Data:

Data Collection

Data was collected from, https://github.com/fivethirtyeight/data/blob/master/bad-drivers/bad-drivers.csv

The dataset contains data on number of drivers involved in fatal collisions per billion miles, diferent conditions under those collisions happend, car insurance premium in U.S. dollars and losses incurred by insurance companies for collisions per insured driver in U.S. dollars for all 50 states of United States of America and District of Columbia.

The source of each variables can be found in the following table:

Variable <- c("Number of drivers involved in fatal collisions per billion miles", "Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding", "Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired",
"Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted",
"Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents", "Car Insurance Premiums ($)", "Losses incurred by insurance companies for collisions per insured driver ($)")

Source <- c("National Highway Traffic Safety Administration, 2012", "National Highway Traffic Safety Administration, 2009", "National Highway Traffic Safety Administration, 2012", "National Highway Traffic Safety Administration, 2012", "National Highway Traffic Safety Administration, 2012", "National Association of Insurance Commissioners, 2011", "National Association of Insurance Commissioners, 2010")

data_source <- data.frame(Variable, Source)

knitr::kable(data_source)
Variable Source
Number of drivers involved in fatal collisions per billion miles National Highway Traffic Safety Administration, 2012
Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding National Highway Traffic Safety Administration, 2009
Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired National Highway Traffic Safety Administration, 2012
Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted National Highway Traffic Safety Administration, 2012
Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents National Highway Traffic Safety Administration, 2012
Car Insurance Premiums ($) National Association of Insurance Commissioners, 2011
Losses incurred by insurance companies for collisions per insured driver ($) National Association of Insurance Commissioners, 2010
Cases:

There are 51 cases, 1 for each of 50 states plus 1 for District of Columbia with 8 variables.

Variables:

Our response variable for this project is Car Insurance Premiums ($) and the two chosen explanatory variables are:

  • Number of drivers involved in fatal collisions per billion miles.

  • Losses incurred by insurance companies for collisions per insured driver ($).

Type of Study:

The study is observational, the data is collected based on what is seen whithout any interference.

Scope of inference - generalizability:

The data represents entire population of U.S. drivers. Therefore, the data data is generalizable to the entire U.S. population.

Scope of inference - causality:

The data on collision includeds that involves fatality only, where as the losses incurred by the insurance companies per insured drivers represents all type of collisions not just fatal collisions. Also, the differnce in insurance premimum state by state will be influenced by other factors such as vehicular theft, natural disaster etc.

Part 3 - Exploratory data analysis:

To start our analysis, I have removed the unnecessary columns from the original data and renamed the column for ease of analysis. We have one table with four columns, below are the names of the columns and what they represent:

  • state: names of the state that the data reprensents.

  • fatal_accident: Number of drivers involved in fatal collisions per billion miles.

  • losses: Losses incurred by insurance companies for collisions per insured driver ($).

  • insurance_premiums: Average car insurance premiums for the state ($)

##Retrive names of columns
names(data)
##Select only required columns
sumdata <- data.frame(data$State, data$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles,  data$Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...., data$Car.Insurance.Premiums....)

Final Data

##Rename required columns
names(sumdata) <- c("state", "fatal_accident", "losses", "insurance_premiums")
datatable(sumdata, options = list(pageLength = 5))

Summary data

summary(sumdata)
##         state    fatal_accident      losses       insurance_premiums
##  Alabama   : 1   Min.   : 5.90   Min.   : 82.75   Min.   : 642.0    
##  Alaska    : 1   1st Qu.:12.75   1st Qu.:114.64   1st Qu.: 768.4    
##  Arizona   : 1   Median :15.60   Median :136.05   Median : 859.0    
##  Arkansas  : 1   Mean   :15.79   Mean   :134.49   Mean   : 887.0    
##  California: 1   3rd Qu.:18.50   3rd Qu.:151.87   3rd Qu.:1007.9    
##  Colorado  : 1   Max.   :23.90   Max.   :194.78   Max.   :1301.5    
##  (Other)   :45

Bar plot representing the data

plot1 <- ggplot(data = (gather(sumdata,"variable", "value", 2:4)), aes(x = state, y = value, fill = variable))+ geom_bar(stat="identity", position="dodge") + ggtitle("Fatal Collision, Losses and Premiums") + ylab("Number of Drivers Involved in Fatal collision Per Billion Mile/
Losses Iincurred by Insurance Companies for Collisions Per Insured Driver($)/
Insurance Premiums ($)") +  facet_wrap(~ variable) + coord_flip()

ggplotly(plot1)

Preparing data

m1 <- lm(insurance_premiums ~ fatal_accident, data = sumdata)
summary(m1)
## 
## Call:
## lm(formula = insurance_premiums ~ fatal_accident, data = sumdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -249.23 -136.43  -22.29  133.45  435.28 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1023.354     98.748  10.363 6.08e-14 ***
## fatal_accident   -8.638      6.055  -1.427     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 176.5 on 49 degrees of freedom
## Multiple R-squared:  0.03988,    Adjusted R-squared:  0.02029 
## F-statistic: 2.035 on 1 and 49 DF,  p-value: 0.16
plot(sumdata$insurance_premiums ~ sumdata$fatal_accident)
abline(m1)
cor( sumdata$insurance_premiums, sumdata$fatal_accident)
## [1] -0.1997019

If the number of drivers involved in fatal collisions per billion miles increases by 1 the insurance premium goes down by $8.64, which is surprising. Only 3.99% of the variance found in the response variable (insurance_premiums) can be explained by the explanatory variable (fatal_accident). There is a very weak negative linear relationship between the two variables.

m2 <- lm(insurance_premiums ~ losses, data = sumdata)
summary(m2)
## 
## Call:
## lm(formula = insurance_premiums ~ losses, data = sumdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -213.33  -96.75  -40.11  112.24  379.97 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 285.3251   109.6689   2.602   0.0122 *  
## losses        4.4733     0.8021   5.577 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 140.9 on 49 degrees of freedom
## Multiple R-squared:  0.3883, Adjusted R-squared:  0.3758 
## F-statistic:  31.1 on 1 and 49 DF,  p-value: 1.043e-06
plot(sumdata$insurance_premiums ~ sumdata$losses)
abline(m2)
cor( sumdata$insurance_premiums, sumdata$losses)
## [1] 0.6231164

Every dollar increase in losses incurred by the insurance companies the insurance premimum goes up by $4.4733, roughly 38.83% of the of the variance found in the response variable (insurance_premiums) can be explained by this predictor variable (losses). There is a moderate positive linear relationship between the two variables

Research question

Now, to answer our research question we will try to predict the insurance premium of three states with highest, lowest and median insurance premium by using the two chosen variables for the states; fatal_accident (number of drivers involved in fatal collisions per billion miles) and losses (losses incurred by insurance companies for collisions per insured driver).

For the project I will analyze the data from three stats with maximum, minimum and median average insurance premimums.

State with maximum insurance premium

sumdata %>% 
      filter(insurance_premiums==max(insurance_premiums))
##        state fatal_accident losses insurance_premiums
## 1 New Jersey           11.2 159.85            1301.52

State with minimum insurance premium

sumdata %>% 
      filter(insurance_premiums==min(insurance_premiums))
##   state fatal_accident losses insurance_premiums
## 1 Idaho           15.3  82.75             641.96

State with median insurance premium

sumdata %>% 
      filter(insurance_premiums==median(insurance_premiums))
##            state fatal_accident losses insurance_premiums
## 1 South Carolina           23.9 116.29             858.97

To predict the insurance premiums we will use the Least Square Regression Line Equation:

\[ \hat{y} = \beta_0 + \beta_1x \]

Where,

\(\beta_1\) = The slope of the regression line

\(\beta_0\) = The intercept point of the regression line and the y axis.

Estimate New Jersey average insurance premimum by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{New Jersey} = 926.582 \]

Our model under estimates the insurance premimum by 374.938

Estimate Idaho average insurance premimum by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{Idaho} = 891.158 \]

Our model over estimates the insurance premimum by 249.198

Estimate South Carolina average insurance premimum by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{South Carolina} = 816.854 \]

Our model under estimates the insurance premimum by -67.612

Estimate New Jersey average insurance premimum by looking at losses incurred by insurance companies for collisions per insured driver ($):

\[ \hat{New Jersey} = 1000.33405 \]

Our model under estimates the insurance premimum by 301.18595

Estimate Idaho average insurance premimum by looking at losses incurred by insurance companies for collisions per insured driver ($):

\[ \hat{Idaho} = 655.46575 \]

Our model over estimates the insurance premimum by 13.50575

Estimate South Carolina average insurance premimum by looking at number of drivers involved in fatal collisions per billion miles:

\[ \hat{South Carolina} = 805.49017 \]

Our model under estimates the insurance premimum by 53.47983

Part 4 - Inference:

For my project I have used linear regression model to preditct insurance premiums from two variables. But is it appropriate to use the linear regression model here? To answer that I will conduct a model diagnostic to satisfy the following conditons:

Linearity: The data should show a linear trend

Nearly Normal Residuals: Generally the residuals must be nearly normal.

Constant Variability: The variability of points around the least squares line remains roughly constant.

Independent Observations: The observations of the data set must be independent.

Diagnose fatal_accident model

par(mfrow=c(2,2))
plot(sumdata$fatal_accident, sumdata$insurance_premiums)
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
plot(sumdata$fatal_accident, m1$residuals)
abline(h = 0, lty = 3)

From the scatter plot above we can see there is low but a negative linear trend with some influencial outliers. The Q-Q plot and histogram indicats nearly normal residual and the residual plot shows constant variability. Also we have strong evidence of the independence of the data.

Diagnose losses model

par(mfrow=c(2,2))
plot(sumdata$losses, sumdata$insurance_premiums)
hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
plot(sumdata$losses, m2$residuals)
abline(h = 0, lty = 3)

From the scatter plot above we can see there is moderate positive linear trend. The Q-Q plot and histogram indicats nearly normal residual and the residual plot shows constant variability. Also we have strong evidence of the independence of the data.

We can conclude both our models satisfy conditions of linear regression model.

Part 5 - Conclusion:

In my project I have tried to estimate average insurance premimum by states from two variables from my initial data set using linear regression model and compare it with actual value. One of my variable, “losses incurred by insurance companies for collisions per insured driver ($)” came closest to actual value but still the difference is too high. There are lots of other varibles that contributes to this difference in average insurance premimums which are out of scope for this project.