Data 607 - Final Project

Proposal

My original proposal was to analyze build orders in a video game called Halo Wars 2, but in the interest of timing and completion, I decided against my original proposal and switch gears to determine which state has the worst drivers. As a driver myself, I often come across other drivers that just seem to be doing things wrongly. As frustrating as it can be on the road here in NY, I am led to believe that there must be other states with worse drivers. As a frequent traveler, I feel comfortable saying that I have seen some of the worst drivers in the world.

Data Acquisition

The bad drivers dataset can be procured directly from the FiveThirtEight GitHub repo. We will use readr’s read_csv function to parse the csv directly into a tibble.

# bad drivers data
badDrivers <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv") %>% 
    rename_all(funs(str_to_lower(.)))

## Parsed with column specification:
## cols(
##   State = col_character(),
##   `Number of drivers involved in fatal collisions per billion miles` = col_double(),
##   `Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding` = col_integer(),
##   `Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired` = col_integer(),
##   `Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted` = col_integer(),
##   `Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents` = col_integer(),
##   `Car Insurance Premiums ($)` = col_double(),
##   `Losses incurred by insurance companies for collisions per insured driver ($)` = col_double()
## )

The following code will retrive data from my MongoDB Atlas account through their Cloud API

# mongoURL
url <- "mongodb://bacurtin2:ih3VBt29@data607-shard-00-00-hbxuz.mongodb.net:27017,data607-shard-00-01-hbxuz.mongodb.net:27017,data607-shard-00-02-hbxuz.mongodb.net:27017/test?ssl=true&replicaSet=Data607-shard-0&authSource=admin"

# download
mDB <- mongo(collection = "regionMappings", db = "project", url = url)
regionMap <- mDB$find("{}")

Data Cleaning

After acquiring the data, it is often that it would need to be cleaned. Those processes will be described and performed below.

bdData <- badDrivers %>% # join badDrivers to regionMap
left_join(regionMap, by = "state") %>% # rename columns for easier reference
rename(driver_fatalities = "number of drivers involved in fatal collisions per billion miles", 
    speeding_percent = "percentage of drivers involved in fatal collisions who were speeding", 
    alcohol_percent = "percentage of drivers involved in fatal collisions who were alcohol-impaired", 
    not_distracted_percent = "percentage of drivers involved in fatal collisions who were not distracted", 
    no_prior_accident_percent = "percentage of drivers involved in fatal collisions who had not been involved in any previous accidents", 
    insurance_premiums = "car insurance premiums ($)", insuranceco_losses = "losses incurred by insurance companies for collisions per insured driver ($)") %>% 
    # reorder columns
select(1, 9:11, 2:8)

Data Analysis

Exploratory Analysis

It is generally accepted practice to “peek” at the structure of the data before going into full blown analysis.

str(bdData)

## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  11 variables:
##  $ state                    : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ state code               : chr  "AL" "AK" "AZ" "AR" ...
##  $ region                   : chr  "South" "West" "West" "South" ...
##  $ division                 : chr  "East South Central" "Pacific" "Mountain" "West South Central" ...
##  $ driver_fatalities        : num  18.8 18.1 18.6 22.4 12 13.6 10.8 16.2 5.9 17.9 ...
##  $ speeding_percent         : int  39 41 35 18 35 37 46 38 34 21 ...
##  $ alcohol_percent          : int  30 25 28 26 28 28 36 30 27 29 ...
##  $ not_distracted_percent   : int  96 90 84 94 91 79 87 87 100 92 ...
##  $ no_prior_accident_percent: int  80 94 96 95 89 95 82 99 100 94 ...
##  $ insurance_premiums       : num  785 1053 899 827 878 ...
##  $ insuranceco_losses       : num  145 134 110 142 166 ...

summary(bdData)

##     state            state code           region         
##  Length:51          Length:51          Length:51         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    division         driver_fatalities speeding_percent alcohol_percent
##  Length:51          Min.   : 5.90     Min.   :13.00    Min.   :16.00  
##  Class :character   1st Qu.:12.75     1st Qu.:23.00    1st Qu.:28.00  
##  Mode  :character   Median :15.60     Median :34.00    Median :30.00  
##                     Mean   :15.79     Mean   :31.73    Mean   :30.69  
##                     3rd Qu.:18.50     3rd Qu.:38.00    3rd Qu.:33.00  
##                     Max.   :23.90     Max.   :54.00    Max.   :44.00  
##  not_distracted_percent no_prior_accident_percent insurance_premiums
##  Min.   : 10.00         Min.   : 76.00            Min.   : 642.0    
##  1st Qu.: 83.00         1st Qu.: 83.50            1st Qu.: 768.4    
##  Median : 88.00         Median : 88.00            Median : 859.0    
##  Mean   : 85.92         Mean   : 88.73            Mean   : 887.0    
##  3rd Qu.: 95.00         3rd Qu.: 95.00            3rd Qu.:1007.9    
##  Max.   :100.00         Max.   :100.00            Max.   :1301.5    
##  insuranceco_losses
##  Min.   : 82.75    
##  1st Qu.:114.64    
##  Median :136.05    
##  Mean   :134.49    
##  3rd Qu.:151.87    
##  Max.   :194.78

Lets have a look at the distribution of driver fatalities

hist(bdData$driver_fatalities)

Using R’s base graphics, we can see that the distribution of fatalities is bi-modal, slightly skewed left, and follows somewhat of a normal distribution. The outlier appears to be Washington DC. We can retroactively remove DC from the dataset.

bdData2 <- bdData %>% filter(`state code` != "DC")

hist(bdData2$driver_fatalities)

After removing DC, we certainly have a more normal distribution despite the bi-modality. Lets have a look at insurance premiums.

hist(bdData2$insurance_premiums)

This graph is very interesting. It is most certainly skewed right and unimodal. With a basic understanding of the data now, we can perform real analysis.

Analysis

Since we have included some categorical variables in the dataset, we can do some analysis on the region that experiences the most fatalities.

ggplot(bdData2, aes(x = region, y = driver_fatalities, fill = region)) + geom_boxplot() + 
    labs(x = "Region", y = "Driver Fatalities", title = "Region vs. Driver Fatalities", 
        fill = "")

The side-by-side boxplots show a grim view of Southern drivers. The Southern region by far has the highest median, and its Q1 is higher than both the Midwest’s and the Northeast’s Q3 value, just from eyeballing the graph. We can also view this data in a scatterplot.

ggplot(bdData2, aes(x = driver_fatalities, y = insurance_premiums, col = region)) + 
    geom_jitter(alpha = 0.8) + geom_smooth(method = "lm") + facet_grid(. ~ region) + 
    labs(x = "Driver Fatalities", y = "Insurance Premiums", title = "Driver Fatalities vs. Insurance Premiums", 
        col = "")

This graph has some very interesting revelations. In three of the regions, there is a negative relationship between driver fatalities and insurance premiums. Only Western states experience increasing premiums with increasing driver fatalities. This graph also illuminates the regions with the highest premiums being the Northeast and the South. Northeast has higher premiums despite bring lower in driver fatalities.

Lets get a little more robust and create a linear model of driver fatalities.

model <- lm(driver_fatalities ~ region + speeding_percent + alcohol_percent + 
    not_distracted_percent + no_prior_accident_percent + insurance_premiums + 
    insuranceco_losses, data = bdData2)
summary(model)

## 
## Call:
## lm(formula = driver_fatalities ~ region + speeding_percent + 
##     alcohol_percent + not_distracted_percent + no_prior_accident_percent + 
##     insurance_premiums + insuranceco_losses, data = bdData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8763 -1.6193 -0.2416  1.9035  6.3984 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)               13.194687   9.143921   1.443  0.15681   
## regionNortheast           -2.874222   1.671314  -1.720  0.09321 . 
## regionSouth                4.743886   1.423845   3.332  0.00187 **
## regionWest                -0.146481   1.730290  -0.085  0.93296   
## speeding_percent           0.014744   0.059434   0.248  0.80534   
## alcohol_percent            0.155759   0.100925   1.543  0.13063   
## not_distracted_percent     0.029206   0.032186   0.907  0.36962   
## no_prior_accident_percent -0.003271   0.079731  -0.041  0.96748   
## insurance_premiums         0.001476   0.004124   0.358  0.72224   
## insuranceco_losses        -0.051594   0.030827  -1.674  0.10200   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.233 on 40 degrees of freedom
## Multiple R-squared:  0.4423, Adjusted R-squared:  0.3168 
## F-statistic: 3.525 on 9 and 40 DF,  p-value: 0.002697

Since we have an Adjusted R-squared of only .3168, we will attempt to remove some variables from the model in order to get a better predictor.

model <- lm(driver_fatalities ~ region + alcohol_percent + insuranceco_losses, 
    data = bdData2)
summary(model)

## 
## Call:
## lm(formula = driver_fatalities ~ region + alcohol_percent + insuranceco_losses, 
##     data = bdData2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.771 -1.864 -0.249  1.892  6.612 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        15.69391    4.14492   3.786  0.00046 ***
## regionNortheast    -2.51544    1.41280  -1.780  0.08191 .  
## regionSouth         4.68944    1.35275   3.467  0.00119 ** 
## regionWest          0.02784    1.26761   0.022  0.98258    
## alcohol_percent     0.17152    0.08822   1.944  0.05827 .  
## insuranceco_losses -0.04490    0.02178  -2.061  0.04519 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.128 on 44 degrees of freedom
## Multiple R-squared:  0.4258, Adjusted R-squared:  0.3605 
## F-statistic: 6.526 on 5 and 44 DF,  p-value: 0.0001279

The best adjusted R-squared we could get from this set of variables is .3605. This indicates that there are variables not in this dataset that influence driver fatalities. The other piece of information we can garner from the model is that regionSouth has the greatest impact on driver_fatalities. This would seem to indicate that Southern drivers are the worst drivers. Another interesting fact is that it would appear Northeast drivers are the best. Their negative coefficient indicates that being from the Northeast actually reduces the driver fatality figure.

The last few analyses will focus on speed, drinking, and distraction.

ggplot(bdData2, aes(x = speeding_percent)) + geom_histogram(binwidth = 4) + 
    facet_grid(. ~ region)

ggplot(bdData2, aes(x = alcohol_percent)) + geom_histogram(binwidth = 4) + facet_grid(. ~ 
    region)

ggplot(bdData2, aes(x = not_distracted_percent)) + geom_histogram(binwidth = 4) + 
    facet_grid(. ~ region)

All three of these graphs show that Southern drivers experience the most driver fatalities in all the different scenarios.

Conclusion

Based on statistical evidence, Southern drivers are the worst driver in the US. The list of Southern states include:

regionMap %>% filter(region == "South") %>% select(state)