# load data
drivers <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv")
head(drivers)
names(drivers) <- c("State", "Collisions", "Perc.Speeding", "Perc.Alcohol", "Perc.Not.Distracted", "Perc.No.Pre.Accident", "Insurance.Premium", "Insurance.Loss")
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Are insurance incurred losses predictive of higher car insurance premiums?
What are the cases, and how many are there?
Each case represents a state in the United States. There are 51 observations in the given data set(inlcuding District of Columbia).
Describe the method of data collection.
The data is collected from National Highway Traffic Safety Administration and National Association of Insurance Commissioners by FiveThirtyEight.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
Data is collected by FiveThirtyEight and is available online here: https://github.com/fivethirtyeight/data/tree/master/bad-drivers. For this project, data was read from github raw csv file.
What is the response variable? Is it quantitative or qualitative?
The response variable is car Insurance Premium and is quantitatie.
You should have two independent variables, one quantitative and one qualitative.
The two independent variables are State (qualitative) and Insurance Loss (quantitative).
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
library(DT)
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
datatable(drivers)
describe(drivers$Insurance.Premium)
## vars n mean sd median trimmed mad min max range skew
## X1 1 51 886.96 178.3 858.97 870.51 187.83 641.96 1301.52 659.56 0.73
## kurtosis se
## X1 -0.43 24.97
describe(drivers$Insurance.Loss)
## vars n mean sd median trimmed mad min max range skew
## X1 1 51 134.49 24.84 136.05 134.01 26.2 82.75 194.78 112.03 0.15
## kurtosis se
## X1 -0.23 3.48
summary(drivers)
## State Collisions Perc.Speeding Perc.Alcohol
## Alabama : 1 Min. : 5.90 Min. :13.00 Min. :16.00
## Alaska : 1 1st Qu.:12.75 1st Qu.:23.00 1st Qu.:28.00
## Arizona : 1 Median :15.60 Median :34.00 Median :30.00
## Arkansas : 1 Mean :15.79 Mean :31.73 Mean :30.69
## California: 1 3rd Qu.:18.50 3rd Qu.:38.00 3rd Qu.:33.00
## Colorado : 1 Max. :23.90 Max. :54.00 Max. :44.00
## (Other) :45
## Perc.Not.Distracted Perc.No.Pre.Accident Insurance.Premium
## Min. : 10.00 Min. : 76.00 Min. : 642.0
## 1st Qu.: 83.00 1st Qu.: 83.50 1st Qu.: 768.4
## Median : 88.00 Median : 88.00 Median : 859.0
## Mean : 85.92 Mean : 88.73 Mean : 887.0
## 3rd Qu.: 95.00 3rd Qu.: 95.00 3rd Qu.:1007.9
## Max. :100.00 Max. :100.00 Max. :1301.5
##
## Insurance.Loss
## Min. : 82.75
## 1st Qu.:114.64
## Median :136.05
## Mean :134.49
## 3rd Qu.:151.87
## Max. :194.78
##
ggplot(drivers, aes(x=Insurance.Loss)) + geom_histogram(binwidth = 10)
ggplot(drivers, aes(x=Insurance.Premium)) + geom_histogram(binwidth = 80)
drivers %>% ggplot(aes(x=Insurance.Loss, y=Insurance.Premium)) + geom_point()