#import data, which was scraped from FiveThirtyEight - bad-drivers
data <- read.csv("https://raw.githubusercontent.com/datasets/five-thirty-eight-datasets/master/datasets/bad-drivers/data/bad-drivers.csv")
library(tibble)
glimpse (data)
## Rows: 51
## Columns: 8
## $ state <chr> …
## $ number.of.drivers.involved.in.fatal.collisions.per.billion.miles <dbl> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.were.speeding <int> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.were.alcohol.impaired <int> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.were.not.distracted <int> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.had.not.been.involved.in.any.previous.accidents <int> …
## $ car.insurance.premiums.... <dbl> …
## $ losses.incurred.by.insurance.companies.for.collisions.per.insured.driver.... <dbl> …
#ideally clean the data and create a data frame of variables we want to examine, for the purpose of this discussion board assignment, I will skip those steps and create a model
data.lm <-lm(car.insurance.premiums.... ~ number.of.drivers.involved.in.fatal.collisions.per.billion.miles, data = data) #2nd variable explains the output or first variable
data.lm
##
## Call:
## lm(formula = car.insurance.premiums.... ~ number.of.drivers.involved.in.fatal.collisions.per.billion.miles,
## data = data)
##
## Coefficients:
## (Intercept)
## 1023.354
## number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## -8.638
I was interested to see if number of drivers involved in fatal collisions had a strong relationship or explains the cost of car insurance premiums. Based on the model, predicted car insurance premiums would equal 1023.35 - 8.64* number of drivers involved in fatal collisions.
#plot the original data and model
plot(car.insurance.premiums.... ~ number.of.drivers.involved.in.fatal.collisions.per.billion.miles, data = data) #just from this step there does not seem to be a clear pattern on the scatter plot
abline(data.lm)
residuals <- resid(data.lm) #differences between the observed and predicted values i
residuals #this is why you should make a data frame!
## 1 2 3 4 5 6
## -76.408771 186.474585 36.783617 -2.521744 -41.287600 -70.376699
## 7 8 9 10 11 12
## 138.666724 254.452265 301.500215 291.396972 24.549427 -11.008253
## 13 14 15 16 17 18
## -249.231992 -109.677150 -187.642442 -238.676766 -89.146834 34.010193
## 19 20 21 22 23 24
## 435.275936 -231.039604 133.401432 58.617760 209.052333 -163.248951
## 25 26 27 28 29 30
## 24.745553 -93.961541 -22.289807 -162.367217 133.495170 -176.612825
## 31 32 33 34 35 36
## 374.911950 5.436004 317.203819 -169.994897 -128.154649 -203.827667
## 37 38 39 40 41 42
## 30.053099 -108.077150 39.848391 221.518143 42.065351 -186.465933
## 43 44 45 46 47 48
## -87.002127 148.974067 -116.364244 -189.676699 -144.700956 -41.760888
## 49 50 51
## 174.841545 -233.839086 -81.912059
# Plot residuals against predicted values
plot(data.lm$fitted.values, residuals, xlab = "Fitted values", ylab = "Residuals", main = "Residual Plot")
abline(h = 0, col = "red", lty = 2) # Add a horizontal line at y = 0
# Plot a histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
# Perform a normality test on residuals
plot(density(residuals), main = "Density Plot of Residuals", xlab = "Residuals")
# Plot a Q-Q plot
qqnorm(residuals)
qqline(residuals)
##Conclusion: I believe the linear model is not appropriate for this data set and or the variables I selected. The residual plot does not have symmetry above and below the zero, the histogram and the density plot shows that there is not a normal distribution, and skews to the right, additionally the plot shows too many points that do not fall on the straight line.