DATA 606 Final Project Proposal

From the FiveThirtyEight page, I found a dataset to figure out which state has the "worst" drivers: https://github.com/fivethirtyeight/data/blob/master/bad-drivers/bad-drivers.csv. This caught my attention because all of my relatives from other states laugh at me and say I am a bad driver since I am from NY (I don't think I am a bad driver at all). I am here to prove them wrong (hopefully). The article is here: https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/. I havent't read the article or looked at the graphs closely, due to me wanting to find out the results myself.

# load data
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

url <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv')
head(url)

##        State Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## 1    Alabama                                                             18.8
## 2     Alaska                                                             18.1
## 3    Arizona                                                             18.6
## 4   Arkansas                                                             22.4
## 5 California                                                             12.0
## 6   Colorado                                                             13.6
##   Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## 1                                                                   39
## 2                                                                   41
## 3                                                                   35
## 4                                                                   18
## 5                                                                   35
## 6                                                                   37
##   Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## 1                                                                           30
## 2                                                                           25
## 3                                                                           28
## 4                                                                           26
## 5                                                                           28
## 6                                                                           28
##   Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## 1                                                                         96
## 2                                                                         90
## 3                                                                         84
## 4                                                                         94
## 5                                                                         91
## 6                                                                         79
##   Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## 1                                                                                                     80
## 2                                                                                                     94
## 3                                                                                                     96
## 4                                                                                                     95
## 5                                                                                                     89
## 6                                                                                                     95
##   Car.Insurance.Premiums....
## 1                     784.55
## 2                    1053.48
## 3                     899.47
## 4                     827.34
## 5                     878.41
## 6                     835.50
##   Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## 1                                                                       145.08
## 2                                                                       133.93
## 3                                                                       110.35
## 4                                                                       142.39
## 5                                                                       165.63
## 6                                                                       139.91

class(url)

## [1] "data.frame"

colnames(url)

## [1] "State"                                                                                                 
## [2] "Number.of.drivers.involved.in.fatal.collisions.per.billion.miles"                                      
## [3] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding"                                  
## [4] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired"                          
## [5] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted"                            
## [6] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents"
## [7] "Car.Insurance.Premiums...."                                                                            
## [8] "Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...."

mean(url$Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired)

## [1] 30.68627

31% is the average percent of drivers involved in fatal collisions who were impaired by alcohol.

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Does driver behavior predict car insurance? The worse the driver in a state is, would the car insurance premium be higher?

Cases

What are the cases, and how many are there? There are 51 cases, one for each state in the United States.

Data collection

Describe the method of data collection. I found the dataset from FiveThirtyEight on Github, I will just need to import the raw file.

Type of study

What type of study is this (observational/experiment)? This dataset is based on an observational study, collected from collisions.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. FiveThirtyEight Article, DataSet

describe(url$Car.Insurance.Premiums)

##    vars  n   mean    sd median trimmed    mad    min     max  range skew
## X1    1 51 886.96 178.3 858.97  870.51 187.83 641.96 1301.52 659.56 0.73
##    kurtosis    se
## X1    -0.43 24.97

describe(url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles)

##    vars  n  mean   sd median trimmed mad min  max range skew kurtosis   se
## X1    1 51 15.79 4.12   15.6   15.73 4.3 5.9 23.9    18 0.04    -0.53 0.58

ggplot(url, aes(x=url$Car.Insurance.Premiums)) + geom_histogram()

## Warning: Use of `url$Car.Insurance.Premiums` is discouraged. Use
## `Car.Insurance.Premiums` instead.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(url, aes(x=url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles)) + geom_histogram()

## Warning: Use of
## `url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles`
## is discouraged. Use
## `Number.of.drivers.involved.in.fatal.collisions.per.billion.miles` instead.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(x = url$Car.Insurance.Premiums, y = url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles,
   xlab = "Car Insurance",
   ylab = "Collisions",
   main = "Car Insurance Vs. Collisions"
   )

The above plot seems like there is no linearity.

Questions/Side Thoughts:

Since I am trying to look at linear regression, can I follow Lab 8's structure? This my first statistical analysis without any hints or suggestions to lean on, so I feel a bit intimidated.