From the FiveThirtyEight page, I found a dataset to figure out which state has the "worst" drivers: https://github.com/fivethirtyeight/data/blob/master/bad-drivers/bad-drivers.csv. This caught my attention because all of my relatives from other states laugh at me and say I am a bad driver since I am from NY (I don't think I am a bad driver at all). I am here to prove them wrong (hopefully). The article is here: https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/. I havent't read the article or looked at the graphs closely, due to me wanting to find out the results myself.
# load data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
url <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv')
head(url)
## State Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## 1 Alabama 18.8
## 2 Alaska 18.1
## 3 Arizona 18.6
## 4 Arkansas 22.4
## 5 California 12.0
## 6 Colorado 13.6
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## 1 39
## 2 41
## 3 35
## 4 18
## 5 35
## 6 37
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## 1 30
## 2 25
## 3 28
## 4 26
## 5 28
## 6 28
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## 1 96
## 2 90
## 3 84
## 4 94
## 5 91
## 6 79
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## 1 80
## 2 94
## 3 96
## 4 95
## 5 89
## 6 95
## Car.Insurance.Premiums....
## 1 784.55
## 2 1053.48
## 3 899.47
## 4 827.34
## 5 878.41
## 6 835.50
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## 1 145.08
## 2 133.93
## 3 110.35
## 4 142.39
## 5 165.63
## 6 139.91
class(url)
## [1] "data.frame"
colnames(url)
## [1] "State"
## [2] "Number.of.drivers.involved.in.fatal.collisions.per.billion.miles"
## [3] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding"
## [4] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired"
## [5] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted"
## [6] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents"
## [7] "Car.Insurance.Premiums...."
## [8] "Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...."
mean(url$Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired)
## [1] 30.68627
31% is the average percent of drivers involved in fatal collisions who were impaired by alcohol.
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Does driver behavior predict car insurance? The worse the driver in a state is, would the car insurance premium be higher?
What are the cases, and how many are there? There are 51 cases, one for each state in the United States.
Describe the method of data collection. I found the dataset from FiveThirtyEight on Github, I will just need to import the raw file.
What type of study is this (observational/experiment)? This dataset is based on an observational study, collected from collisions.
If you collected the data, state self-collected. If not, provide a citation/link. FiveThirtyEight Article, DataSet
describe(url$Car.Insurance.Premiums)
## vars n mean sd median trimmed mad min max range skew
## X1 1 51 886.96 178.3 858.97 870.51 187.83 641.96 1301.52 659.56 0.73
## kurtosis se
## X1 -0.43 24.97
describe(url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 51 15.79 4.12 15.6 15.73 4.3 5.9 23.9 18 0.04 -0.53 0.58
ggplot(url, aes(x=url$Car.Insurance.Premiums)) + geom_histogram()
## Warning: Use of `url$Car.Insurance.Premiums` is discouraged. Use
## `Car.Insurance.Premiums` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(url, aes(x=url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles)) + geom_histogram()
## Warning: Use of
## `url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles`
## is discouraged. Use
## `Number.of.drivers.involved.in.fatal.collisions.per.billion.miles` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(x = url$Car.Insurance.Premiums, y = url$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles,
xlab = "Car Insurance",
ylab = "Collisions",
main = "Car Insurance Vs. Collisions"
)
The above plot seems like there is no linearity.
Since I am trying to look at linear regression, can I follow Lab 8's structure? This my first statistical analysis without any hints or suggestions to lean on, so I feel a bit intimidated.