Data Preparation

# load data
drivers <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv")

head(drivers)

names(drivers) <- c("State", "Collisions", "Perc.Speeding", "Perc.Alcohol", "Perc.Not.Distracted", "Perc.No.Pre.Accident", "Insurance.Premium", "Insurance.Loss")

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Are insurance incurred losses predictive of higher car insurance premiums?

Cases

What are the cases, and how many are there?

Each case represents a state in the United States. There are 51 observations in the given data set(inlcuding District of Columbia).

Data collection

Describe the method of data collection.

The data is collected from National Highway Traffic Safety Administration and National Association of Insurance Commissioners by FiveThirtyEight.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Data is collected by FiveThirtyEight and is available online here: https://github.com/fivethirtyeight/data/tree/master/bad-drivers. For this project, data was read from github raw csv file.

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is car Insurance Premium and is quantitatie.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The two independent variables are State (qualitative) and Insurance Loss (quantitative).

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

library(DT)
library(psych)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
datatable(drivers)
describe(drivers$Insurance.Premium)
##    vars  n   mean    sd median trimmed    mad    min     max  range skew
## X1    1 51 886.96 178.3 858.97  870.51 187.83 641.96 1301.52 659.56 0.73
##    kurtosis    se
## X1    -0.43 24.97
describe(drivers$Insurance.Loss)
##    vars  n   mean    sd median trimmed  mad   min    max  range skew
## X1    1 51 134.49 24.84 136.05  134.01 26.2 82.75 194.78 112.03 0.15
##    kurtosis   se
## X1    -0.23 3.48
summary(drivers)
##         State      Collisions    Perc.Speeding    Perc.Alcohol  
##  Alabama   : 1   Min.   : 5.90   Min.   :13.00   Min.   :16.00  
##  Alaska    : 1   1st Qu.:12.75   1st Qu.:23.00   1st Qu.:28.00  
##  Arizona   : 1   Median :15.60   Median :34.00   Median :30.00  
##  Arkansas  : 1   Mean   :15.79   Mean   :31.73   Mean   :30.69  
##  California: 1   3rd Qu.:18.50   3rd Qu.:38.00   3rd Qu.:33.00  
##  Colorado  : 1   Max.   :23.90   Max.   :54.00   Max.   :44.00  
##  (Other)   :45                                                  
##  Perc.Not.Distracted Perc.No.Pre.Accident Insurance.Premium
##  Min.   : 10.00      Min.   : 76.00       Min.   : 642.0   
##  1st Qu.: 83.00      1st Qu.: 83.50       1st Qu.: 768.4   
##  Median : 88.00      Median : 88.00       Median : 859.0   
##  Mean   : 85.92      Mean   : 88.73       Mean   : 887.0   
##  3rd Qu.: 95.00      3rd Qu.: 95.00       3rd Qu.:1007.9   
##  Max.   :100.00      Max.   :100.00       Max.   :1301.5   
##                                                            
##  Insurance.Loss  
##  Min.   : 82.75  
##  1st Qu.:114.64  
##  Median :136.05  
##  Mean   :134.49  
##  3rd Qu.:151.87  
##  Max.   :194.78  
## 
ggplot(drivers, aes(x=Insurance.Loss)) + geom_histogram(binwidth = 10)

ggplot(drivers, aes(x=Insurance.Premium)) + geom_histogram(binwidth = 80)

drivers %>% ggplot(aes(x=Insurance.Loss, y=Insurance.Premium)) + geom_point()