Data 605

Using R, build a  regression model for data that interests you. Conduct residual analysis.  Was the linear model appropriate? Why or why not?

#import data, which was scraped from FiveThirtyEight - bad-drivers
data <- read.csv("https://raw.githubusercontent.com/datasets/five-thirty-eight-datasets/master/datasets/bad-drivers/data/bad-drivers.csv")
library(tibble)
glimpse (data)
## Rows: 51
## Columns: 8
## $ state                                                                                                  <chr> …
## $ number.of.drivers.involved.in.fatal.collisions.per.billion.miles                                       <dbl> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.were.speeding                                   <int> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.were.alcohol.impaired                           <int> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.were.not.distracted                             <int> …
## $ percentage.of.drivers.involved.in.fatal.collisions.who.had.not.been.involved.in.any.previous.accidents <int> …
## $ car.insurance.premiums....                                                                             <dbl> …
## $ losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....                           <dbl> …
#ideally clean the data and create a data frame of variables we want to examine, for the purpose of this discussion board assignment, I will skip those steps and create a model
data.lm <-lm(car.insurance.premiums.... ~ number.of.drivers.involved.in.fatal.collisions.per.billion.miles, data = data)     #2nd variable explains the output or first variable
data.lm
## 
## Call:
## lm(formula = car.insurance.premiums.... ~ number.of.drivers.involved.in.fatal.collisions.per.billion.miles, 
##     data = data)
## 
## Coefficients:
##                                                      (Intercept)  
##                                                         1023.354  
## number.of.drivers.involved.in.fatal.collisions.per.billion.miles  
##                                                           -8.638

I was interested to see if number of drivers involved in fatal collisions had a strong relationship or explains the cost of car insurance premiums. Based on the model, predicted car insurance premiums would equal 1023.35 - 8.64* number of drivers involved in fatal collisions.

#plot the original data and model 
plot(car.insurance.premiums.... ~ number.of.drivers.involved.in.fatal.collisions.per.billion.miles, data = data) #just from this step there does not seem to be a clear pattern on the scatter plot
abline(data.lm)

residuals <- resid(data.lm) #differences between the observed and predicted values i
residuals  #this is why you should make a data frame!
##           1           2           3           4           5           6 
##  -76.408771  186.474585   36.783617   -2.521744  -41.287600  -70.376699 
##           7           8           9          10          11          12 
##  138.666724  254.452265  301.500215  291.396972   24.549427  -11.008253 
##          13          14          15          16          17          18 
## -249.231992 -109.677150 -187.642442 -238.676766  -89.146834   34.010193 
##          19          20          21          22          23          24 
##  435.275936 -231.039604  133.401432   58.617760  209.052333 -163.248951 
##          25          26          27          28          29          30 
##   24.745553  -93.961541  -22.289807 -162.367217  133.495170 -176.612825 
##          31          32          33          34          35          36 
##  374.911950    5.436004  317.203819 -169.994897 -128.154649 -203.827667 
##          37          38          39          40          41          42 
##   30.053099 -108.077150   39.848391  221.518143   42.065351 -186.465933 
##          43          44          45          46          47          48 
##  -87.002127  148.974067 -116.364244 -189.676699 -144.700956  -41.760888 
##          49          50          51 
##  174.841545 -233.839086  -81.912059
# Plot residuals against predicted values
plot(data.lm$fitted.values, residuals, xlab = "Fitted values", ylab = "Residuals", main = "Residual Plot")
abline(h = 0, col = "red", lty = 2)  # Add a horizontal line at y = 0

# Plot a histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

# Perform a normality test on residuals
plot(density(residuals), main = "Density Plot of Residuals", xlab = "Residuals")

# Plot a Q-Q plot
qqnorm(residuals)
qqline(residuals)

##Conclusion: I believe the linear model is not appropriate for this data set and or the variables I selected. The residual plot does not have symmetry above and below the zero, the histogram and the density plot shows that there is not a normal distribution, and skews to the right, additionally the plot shows too many points that do not fall on the straight line.