Abstract

Every state has bad drivers, but some states seem to have a greater amount than others. The purpose of this study is to determine which region and state of the United States has the worst drivers. In order to determine this, a ranking system was created that ranked 5 different key variables: Losses Incurred by Insurance Companies, Average Auto Insurance Premium, Road Congestion, Fatalities per Licensed Driver, and Fatalities per Million Vehicle Miles Traveled. These key variables were taken from published reports by The National Highway Traffic Safety Administration (NHTSA) and The National Association of Insurance Commissioners (NAIC). In addition, an Analysis of Variance (ANOVA) test was performed that showed that there is a significant difference in the rankings by region. After the analysis was performed, it was determined that the West South Central region is the region with the worst drivers and Louisiana is the state with the worst drivers. These conclusions can also be found by looking at the data and numerous visualizations throughout this report.

Introduction

Which State Has the Worst Drivers?

In a survey done by the AAA Foundation, it was found that in 2016-2017, Americans spent on average, 51 minutes in the car every day. This totals to 310 hours behind the wheel annually. With all this time in the car, a motorist will likely experience a bad driver. Whether it is the congested interstates of California, the substantial aging population in Florida, or the aggressive drivers of New York, most Americans claim that their own state has the worst drivers. However, while every state has bad drivers, some have more in comparison to others.

Analyzing the Problem

This report looks at a variety of data sources to determine which state really has the worst drivers. To do this, I created a ranking system that ranks each of the key variables. After each category is ranked, the total rank is computed and the state with the highest rank indicating the worst drivers. I looked at this problem from both a region and a state prospective and ranked each using the ranking system. In addition, an ANOVA test was completed using the final weighted rank by region in order to determine if there was a significant difference in the rankings by region. Aside from the rankings and ANOVA, I also created various visualizations which help better understand some of the reasons behind these fatal crashes and the key variables.

Key Variables

Losses Incurred by Insurance Companies
Average Auto Insurance Premium
Road Congestion
Fatalities From Collisions per Licensed Driver
Fatalities From Collisions per Million Vehicle Miles Traveled

Previous Research on This Topic

In 2014, FiveThirtyEight released an article that analyzed which state had the worst drivers. The data used in the analysis is from 2009-2012. Instead of doing my own analysis using this data, I choose to find and analyze more recent data from 2017-2018. Finding new data allows for the results to be more relevant to today’s society.

Data

Packages

The following R libraries are required in order to run the code used in this project. In addition, all of the code for this project featured throughout can be viewed using the “code” tab.

#Creating Package Table
Packages <-data.table(Package=c("rmdformats","DT","readxl","tidyverse","data.table","RColorBrewer","dplyr","rstatix","ggpubr","reshape","usmaps","ggtheme"),
                      Use=c("R Markdown/HTML Template","Displaying Data Tables","Reading Excel Data into R","Data Manipulation","Displaying Data Tables",
                                "Color Palettes for Graphs","Data Manipulation","Statistical Tests","Statistical Tests","Data Manipluation","Template for creating population maps",
                                "Themes for graphs and maps"))

knitr::kable(Packages)

Package	Use
rmdformats	R Markdown/HTML Template
DT	Displaying Data Tables
readxl	Reading Excel Data into R
tidyverse	Data Manipulation
data.table	Displaying Data Tables
RColorBrewer	Color Palettes for Graphs
dplyr	Data Manipulation
rstatix	Statistical Tests
ggpubr	Statistical Tests
reshape	Data Manipluation
usmaps	Template for creating population maps
ggtheme	Themes for graphs and maps

Cleaning

Clean Data

In order to prepare for analysis, the following actions were performed on the data:

Shortened Column Names
Created New Column of Fatalities per 100,000 Licensed Drivers
Rounded Data
Filled in Missing Values(2)
Split States into Regions

#Importing Data
DriverData <- read_xlsx("Project Data.xlsx",col_names=TRUE,na="NA")

#Renaming columns for easier use 
CleanData<-
  rename(DriverData,
    Fatality_Rate= `Fatality Rate per Millon VMT`,
    Drivers= `Drivers Involved in Fatal Crashes`,
    Speeding_Fatalities =`Speeding-Related Fatalities`,
    Alcohol_Fatalities =`Alcohol-Impaired Driving Fatalities (BAC=.08+)`,
    Distracted_Fatalities =`Distracted Driving Fatalities`,
    Auto_Premium =`Average Car Insurance Premium`,
    Insurance_Losses =`Losses incurred by insurance companies for collisions per insured driver`,
    Licensed_Drivers=`Licensed Drivers`,
    Traffic_Congestion=`Millions of Miles Driven per Mile of Roadway`
  )
#Adding Fatalities_Per_Driver column by dividing Fatalities by (Licensed_Drivers/100000)
CleanData <- transform(CleanData, Fatalities_Per_Driver= c(Fatalities/(Licensed_Drivers/100000)))

#Rounding Fatalities_Per_Driver to 6 places
CleanData <- mutate_at(CleanData,vars(Fatalities_Per_Driver), funs(round(., 6)))

Estimating Missing Values

There are 2 missing values in the data

Drivers who were distracted and in fatal crashes in Rhode Island
Losses incurred by insurance companies for collisions per insured driver in Texas

For better analysis, the values were filled in using an average common ratio between variables. This was the best method given the state’s total population. With Rhode Island being one of the smallest states and Texas being one of the largest states, using just the mean of each category for the missing value would have been an inadequate solution.

Rhode Island

mean_Fatalities <- mean(CleanData$Drivers)
mean_distracted <- mean(CleanData$Distracted_Fatalities, na.rm=TRUE)
#Average ratio
Ratio1 <- mean_distracted/mean_Fatalities
#Computing missing value
RhodeDistracted<- CleanData$Fatalities[40]*Ratio1
RhodeDistracted<- round(RhodeDistracted,digits=0)
#Filling missing value
CleanData$Distracted_Fatalities[40]<-RhodeDistracted

Common Ratio between distracted drivers in fatal collisions and total drivers in fatal collisions: 0.0528227

Missing Value: 3

Texas

mean_Premium <- mean(CleanData$Auto_Premium)
mean_Loss <- mean(CleanData$Insurance_Losses, na.rm=TRUE)
#Average ratio
Ratio2 <- mean_Loss/mean_Premium
#Computing missing value
TexasLoss<- CleanData$Auto_Premium[44]*Ratio2
TexasLoss<- round(TexasLoss,digits=2)
#Filling missing value
CleanData$Insurance_Losses[44]<-TexasLoss

Common Ratio between Insurance Loss and Auto Premium: 0.2040097

Missing Value: 279.95

Regions

The states are grouped into 9 different divisions that are specified by The US Census.

Source: US Census

# Splitting all the states into regions
CleanData$Region <- ifelse(CleanData$State %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"), "East North Central", 

ifelse(CleanData$State %in% c("Alabama", "Kentucky", "Mississippi", "Tennessee"), "East South Central",   
       
ifelse(CleanData$State %in% c("New Jersey", "New York", "Pennsylvania"), "Middle Atlantic",
       
ifelse(CleanData$State %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming"), "Mountain",

ifelse(CleanData$State %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"), "New England",
                                                      
ifelse(CleanData$State %in% c("Alaska", "California", "Hawaii", "Oregon", "Washington"), "Pacific",

ifelse(CleanData$State %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "Dist of Columbia", "West Virginia"), "South Atlantic",
                                                                    
ifelse(CleanData$State %in% c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"), "West North Central",
                                                                           
ifelse(CleanData$State %in% c("Arkansas", "Louisiana", "Oklahoma", "Texas"), "West South Central", "NA")))))))))

Description of Data

Obtaining Data

Most of the data analyzed in the project was originally published by The National Highway Traffic Safety Administration or The National Association of Insurance Commissioners.

National Highway Traffic Safety Administration
- Every year, the NHTSA releases a report called Traffic Safety Facts: A Compilation of Motor Vehicle Crash Data. This report presents descriptive statistics about traffic crashes of all severities. The data comes from 3 main sources: The Fatality Analysis Reporting System (FARS), The National Automotive Sampling System General Estimates System (NASS GES), and The New Crash Report Sampling System (CRSS). The data used in this project comes from several of Traffic Safety Facts Annual Report Tables that were featured in the current report.
National Association of Insurance Commissioners
- The NAIC produced a report called Auto Insurance Database Report. The most recent report, published February 2021, included data from 2017-2018. This report provides average costs associated with personal auto insurance nationwide. It also includes countrywide and state-specific data, including earned premiums, incurred losses, earned exposures and number of claims for both voluntary and residual market business.

Variables

All data is annual.

#Creating table to display description of variables and their source
DesTable <- data.table(Variables=colnames(CleanData), 
                       Description=c('50 US States and District of Columbia', 'Fatalities From Fatal Collisions',
                                     'Fatalities per Million Vehicle Miles Traveled',
                                     'Drivers Involved in Fatal Collisions',
                                     'Drivers Involved in Fatal Collisions Who Were Speeding',
                                     'Drivers Involved in Fatal Collisions Who Were Alcohol-Impaired',
                                     'Drivers Involved in Fatal Collisions Who Were Distracted',
                                     'Average Auto Insurance Premium',
                                     'Losses Incurred by Insurance Companies for Collisions per Insured Driver',
                                     'Number of Licensed Drivers',
                                     'Millions of Miles Driven per Mile of Roadway',
                                     'Fatalities per 100,000 Licensed Drivers',
                                     'Region of the US'), 
                       Source=c('N/A','National Highway Traffic Safety Administration, 2018',
                                'National Highway Traffic Safety Administration, 2018',
                                'National Highway Traffic Safety Administration, 2018',
                                'National Highway Traffic Safety Administration, 2018',
                                'National Highway Traffic Safety Administration, 2018',
                                'National Highway Traffic Safety Administration, 2018',
                                'National Association of Insurance Commissioners, 2018',
                                'National Association of Insurance Commissioners, 2017',
                                'National Highway Traffic Safety Administration, 2018',
                                'National Association of Insurance Commissioners, 2017',
                                'Calculated','U.S. Census Bureau'))
#Displaying Table
knitr::kable(DesTable)

Variables	Description	Source
State	50 US States and District of Columbia	N/A
Fatalities	Fatalities From Fatal Collisions	National Highway Traffic Safety Administration, 2018
Fatality_Rate	Fatalities per Million Vehicle Miles Traveled	National Highway Traffic Safety Administration, 2018
Drivers	Drivers Involved in Fatal Collisions	National Highway Traffic Safety Administration, 2018
Speeding_Fatalities	Drivers Involved in Fatal Collisions Who Were Speeding	National Highway Traffic Safety Administration, 2018
Alcohol_Fatalities	Drivers Involved in Fatal Collisions Who Were Alcohol-Impaired	National Highway Traffic Safety Administration, 2018
Distracted_Fatalities	Drivers Involved in Fatal Collisions Who Were Distracted	National Highway Traffic Safety Administration, 2018
Auto_Premium	Average Auto Insurance Premium	National Association of Insurance Commissioners, 2018
Insurance_Losses	Losses Incurred by Insurance Companies for Collisions per Insured Driver	National Association of Insurance Commissioners, 2017
Licensed_Drivers	Number of Licensed Drivers	National Highway Traffic Safety Administration, 2018
Traffic_Congestion	Millions of Miles Driven per Mile of Roadway	National Association of Insurance Commissioners, 2017
Fatalities_Per_Driver	Fatalities per 100,000 Licensed Drivers	Calculated
Region	Region of the US	U.S. Census Bureau

Interactive Datatable

#Interactive Table
datatable(CleanData, options = list(pageLength = 10),filter = 'top')

Summary Statistics

#Getting count, mean, and standard deviation of every variable
SummaryStats<-get_summary_stats(CleanData,type="mean_sd")
SummaryStats<-rename(SummaryStats,
                     Variable='variable',
                     Count='n',
                     Mean='mean',
                     'Standard Deviation'='sd')
knitr::kable(SummaryStats)

Variable	Count	Mean	Standard Deviation
Alcohol_Fatalities	51	209.569	265.160
Auto_Premium	51	1131.772	240.974
Distracted_Fatalities	51	52.765	60.824
Drivers	51	1017.745	1146.004
Fatalities	51	721.471	806.329
Fatalities_Per_Driver	51	16.921	6.028
Fatality_Rate	51	1.139	0.277
Insurance_Losses	51	231.854	46.297
Licensed_Drivers	51	4461929.118	4889545.162
Speeding_Fatalities	51	185.314	202.484
Traffic_Congestion	51	0.854	0.579

Analysis

Ranking

In order to figure out which state has the worst drivers; I created a ranking system which ranks each state and region based on various categories.

Loss_Rank
- 1- Highest Number of Losses Incurred by Insurance Companies for Collisions per Insured Driver
- Losses from collisions is a good measure of the number of accidents drivers are involved in by state. These losses allow us to look at non-fatal accidents and are free of bias due to individual judicial systems because they exclude liability premiums.
Premium_Rank
- 1- Highest Auto Insurance Premium Price
- Auto Insurance Premiums differ greatly by state because each state has different driving conditions. Insurance providers base their prices on a various number of indicators, including driver behavior in accidents that were not fatal. This means that the average premiums in each state could reflect insurance companies’ overall assessment of who is most likely to cost them in the future
Traffic_Rank
- 1- Lowest Number of Miles Driven per Mile of Roadway
- It is important to consider the different levels of road congestion because the more cars there are on the road, the more difficult it is to drive. Thus, increasing the chance that a collision will occur. This ranking is done in ascending order because if 2 regions/states have a similar number of fatal collisions, however one has less road congestion, then that state/region should be considered to have worse drivers.
Fatalities_Per_Driver_Rank
- 1- Greatest Ratio of Fatalities per 100,000 Licensed Drivers
- Fatal Accidents are the worst possible outcome from driving. By looking at fatalities from collisions per 100,000 drivers it can be determined which states have the highest fatality rates by adjusting to the number of licensed drivers in each state.
Fatalities_Per_Miles_Traveled_Rank
- 1- Greatest Ratio of Fatalities per Million Vehicle Miles Traveled
- Fatalities from collisions per miles traveled is a very significant variable because in some states, people drive a greater distance on average than others. In many western states, cities are more spread out and further away from each other, meaning residents living there have a greater distance to travel by car. However, in major cities and states with greater populations, there is less need to travel large distances because everything is close.

Function created for ranking categories:


```r
Ranks<-function(x,HighToLow=TRUE){
  a<-rank(x,ties.method="min")
  n<-length(x)+1
  #Ascending Order
  if(!HighToLow){
    return(a)
  #Descending Order
  }else{
    b<-n-a
    return(b)
  }
}
```

Region Rankings

Ranking Variables of Interest

#Creating subset of variables of interest
RegionData<-aggregate.data.frame(CleanData[,c(3,8,9,11,12)],list(CleanData$Region),mean)
Region<-RegionData$Group.1
#Ranking each category
Loss_Rank <- Ranks(x=RegionData$Insurance_Losses,HighToLow=TRUE)
Premium_Rank<-Ranks(x=RegionData$Auto_Premium,HighToLow=TRUE)
Road_Congestion_Rank<-Ranks(x=RegionData$Traffic_Congestion,HighToLow=FALSE)
Fatalities_Per_Driver_Rank<-Ranks(x=RegionData$Fatalities_Per_Driver,HighToLow=TRUE)
Fatalities_Per_Miles_Traveled_Rank<-Ranks(x=RegionData$Fatality_Rate, HighToLow=TRUE)
#Creating data frame of ranks
RegionRank<-data.frame(Region,Loss_Rank,Premium_Rank,Road_Congestion_Rank,Fatalities_Per_Driver_Rank,Fatalities_Per_Miles_Traveled_Rank)
datatable(RegionRank,rownames=FALSE)

Final Rankings

Weighted Rank Formula: \[ \tiny .2*LossRank+.2*PremiumRank+.1*RoadCongestionRank+.3*FatalitiesPerDriverRank+.2*FatalitiesPerMilesTraveledRank \]

#Computing overall rank
WeightedRank<-.2*Loss_Rank+.2*Premium_Rank+.1*Road_Congestion_Rank+.3*Fatalities_Per_Driver_Rank+.2*Fatalities_Per_Miles_Traveled_Rank
#Final Ranking
FinalRegionRank<-Ranks(WeightedRank, HighToLow=FALSE)
FinalRegionRankTable<-data.frame(Region,WeightedRank,FinalRegionRank)
datatable(FinalRegionRankTable,rownames=FALSE)

The rank system establishes that the region with the worst drivers is the West South Central region with an average weighted rank of 2.4. This region includes Texas, Oklahoma, Arkansas, and Louisiana.

State Rankings

Ranking Variables of Interest

#Ranking each category
Loss_Rank<- Ranks(x=CleanData$Insurance_Losses,HighToLow=TRUE)
Premium_Rank<-Ranks(x=CleanData$Auto_Premium,HighToLow=TRUE)
Road_Congestion_Rank<-Ranks(x=CleanData$Traffic_Congestion,HighToLow=FALSE)
Fatalities_Per_Driver_Rank<-Ranks(x=CleanData$Fatalities_Per_Driver,HighToLow=TRUE)
Fatalities_Per_Miles_Traveled_Rank<-Ranks(x=CleanData$Fatality_Rate, HighToLow=TRUE)
State<-CleanData$State
#Creating data frame of ranks
StateRank<-data.frame(State,Loss_Rank,Premium_Rank,Road_Congestion_Rank,Fatalities_Per_Driver_Rank,Fatalities_Per_Miles_Traveled_Rank)
datatable(StateRank)

Final Rankings

Weighted Rank Formula: \[ \tiny .2*LossRank+.2*PremiumRank+.1*RoadCongestionRank+.3*FatalitiesPerDriverRank+.2*FatalitiesPerMilesTraveledRank \]

#Computing overall rank
WeightedStateRank<-+.2*Loss_Rank+.2*Premium_Rank+.1*Road_Congestion_Rank+.3*Fatalities_Per_Driver_Rank+.2*Fatalities_Per_Miles_Traveled_Rank
#Final Ranking
FinalStateRank<-Ranks(WeightedStateRank, HighToLow=FALSE)
FinalStateRankTable<-data.frame(State,WeightedStateRank,FinalStateRank)
datatable(FinalStateRankTable,rownames=FALSE)

From the ranking system, it can be seen that Louisiana is the state with the worst drivers. Louisiana has a weighted rank of 7.8 and is ranked #1 in Auto Insurance Premium, #3 in losses incurred by insurance companies, and #4 in fatalities per million miles traveled.

ANOVA

Analysis of Variance (ANOVA) is a statistical method which analyzes variance to determine if the means from multiple populations are the same. If the average variation between groups is large in comparison to the other groups, it can be concluded that at least one group’s mean is significantly different than the others. This test was chosen because the data that is being analyzed has a categorical dependent variable with various quantitative independent variables combined into one weighted rank.

Hypothesis and Assumptions

H0 : all means are equal

HA : at least one mean is different

Before an ANOVA test can be performed, the following assumptions need to be made:

Errors are normally distributed.
Errors have constant variance.
Errors have a mean of O.
Errors are independent.

Checking Assumptions

Summary Statistics

#Subset of data used in ANOVA
RegionData<- data.table("Region"=CleanData$Region,"State"=CleanData$State, "WeightedRank"=WeightedStateRank)
AssumpData <-RegionData
#Grouping by region and getting summary stats
GroupedData <-group_by(AssumpData, Region)
Summary<-get_summary_stats(GroupedData,WeightedRank,type="mean_sd")
#Displaying summary stats in table and graph
Summary<-as.data.frame(Summary)
datatable(Summary)

ggplot(AssumpData, aes(y=WeightedRank, x=Region))+coord_flip()+geom_boxplot()

From the graph above, it is clear to see that even before the ANOVA test is performed, the West South Central and East South Central regions seem to have a higher weighted rank. While the New England, East North Central, and West North Central regions seem to have a lower weighted rank. This helps give a preview into what to expect in the ANOVA results.

Normality

Analyzing the ANOVA model residuals using QQ plot and Shapiro-Wilk test of normality

# Linear Model
model <-lm(WeightedRank~Region, data=AssumpData)
# QQ plot of residuals
ggqqplot(residuals(model))

knitr::kable(shapiro_test(residuals(model)),caption="Shapiro-Wilk Test")

Shapiro-Wilk Test
variable	statistic	p.value
residuals(model)	0.9719217	0.2653565

In the QQ plot, all the points fall relatively along the reference line. From this, it can concluded there is no enough evidence to declare non-normality. The results of the Shapiro-Wilk test also support this conclusion. The p-value is not significant for any reasonable alpha which confirms there is not enough evidence to suggest the errors are not normally distributed.

Constant Variance

Constant Variance is analyzed by looking at the residuals versus fits plot and Levene’s test

#Residual plot
plot(model,1)

#Levene's Test
knitr::kable(levene_test(AssumpData, WeightedRank~Region),caption="Levene's Test")

Levene’s Test
df1	df2	statistic	p
8	42	0.8147201	0.5937948

By looking at the residual plot, it can be seen that there is no clear relationship between residuals and fitted values(mean of each group). Therefore, it can be assumed that the errors have a constant variance. This result is confirmed by Levene’s test, the p-value is greater than any reasonably alpha level meaning there is not sufficient evidence to suggest there are differences in the variances among the groups.

Mean of 0

There is no way to check, so it is assumed to be true.

Independence

Each state belongs to one individual region. Therefore, there is no repetition, thus no relationship between the different regions making them independent. The errors can be assumed to be independent as well.

Final ANOVA

ANOVA Results

# ANOVA Model
aov.p<- RegionData %>% anova_test(WeightedRank~Region)
knitr::kable(aov.p,caption="ANOVA Table")

ANOVA Table
Effect	DFn	DFd	F	p	p<.05	ges
Region	8	42	6.278	2.46e-05	*	0.545

From the ANOVA table, it can be seen that there is a significant difference between group means given p=2.46e-05 is less than alpha=.05. Therefore, the null hypothesis of equal means is rejected. The ges column is the generalized eta squared. This is a proportion that measures the variability of the outcome variable in terms of the predictor or group. In terms of the ANOVA preformed above, the ges=.545, meanings that 54.5% of the variation in WeightedRank can be accounted for by the Region.

Visualizations

In addition to the ranking system and ANOVA test, creating visualizations allows trends and patterns in the data to be more easily seen.

Region

By looking at the key variables in bar graphs, it emphasizes that the West South Central region has the worst drivers.

Insurance Losses

#Average Insurance Loss data by region
RegionMap<-aggregate.data.frame(CleanData$Insurance_Losses,list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Loss=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Loss)) +
  geom_bar(stat = "identity",fill="skyblue3") +
  xlab("Region")+
  ylab("Annual Losses per Driver")+
  coord_flip() +
  theme_minimal()+
  ggtitle("Losses by Insurance Companies for Collisions per Insured Driver by Region")

In the figure above, it can be seen that the West South Central and Middle Atlantic regions have the highest amount of losses incurred by insurance companies while the West North Central region has the least.

Auto Premium

#Average Car Premium data by region
RegionMap<-aggregate.data.frame(CleanData[,8],list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Premium=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Premium)) +
  geom_bar(stat = "identity",fill="seagreen") +
  xlab("Region")+
  ylab("Average Auto Insurance Premium")+
  coord_flip() +
  theme_minimal()+
  ggtitle("Average Annual Auto Insurance Premium by Region")

In the figure above, it can be seen that the West South Central, South Atlantic, and Middle Atlantic regions have the highest average auto insurance premium.

Road Congestion

#Average Road Congestion data by region
RegionMap<-aggregate.data.frame(CleanData$Traffic_Congestion,list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Congestion=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Congestion)) +
  geom_bar(stat = "identity",fill="mediumorchid3") +
  xlab("Region")+
  ylab("Millions of Miles Driven per Mile of Roadway")+
  coord_flip() +
  theme_minimal()+
  ggtitle("Millions of Miles Driven per Mile of Roadway by Region")

In the figure above, it can be seen that the South Atlantic, Pacific, and New England regions have the greatest number of miles driver per mile of roadway. This makes sense as the South Atlantic region contains Florida, the Pacific region contains California, and the New England region contains New York.

Fatalities per Licensed Drivers

#Average Fatalities per Licensed Driver data by region
RegionMap<-aggregate.data.frame(CleanData[,12],list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Fatalities=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Fatalities)) +
  geom_bar(stat = "identity",fill="indianred3") +
  xlab("Region")+
  ylab("Fatalities per Licensed Driver")+
  coord_flip() +
  theme_minimal()+
  ggtitle("Fatalities per 100,000 Licensed Drivers by Region")

In the figure above, it can be seen that the West South Central and East South Central regions have the greatest rate of fatalities per 100,000 licensed drivers. It can also be seen than the New England and Middle Atlantic have a smaller rate.

Fatalities per Million VMT

#Average Fatalities per Million VMT data by region
RegionMap<-aggregate.data.frame(CleanData[,3],list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Fatalities=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Fatalities)) +
  geom_bar(stat = "identity",fill="royalblue3") +
  xlab("Region")+
  ylab("Fatalities per Million Vehicle Miles Traveled")+
  coord_flip() +
  theme_minimal()+
  ggtitle("Fatalities per Million Vehicle Miles Traveled by Region")

In the figure above, it can be seen that the West South Central and East South Central regions have the highest rate of fatalities per million vehicle miles traveled.

State

Graphs

Fatalities per Licensed Drivers is one of the most important variables in this analysis, accounting for 30% of the weighted rank. This visualization shows the breakdown of this ratio by state, while being filled by region.

#graph fatalities per 100,000 licensed drivers by state filled by region
ggplot(CleanData, aes(x=reorder(State,Fatalities_Per_Driver), y = Fatalities_Per_Driver, fill = Region)) +
  geom_bar(stat = "identity") +
  xlab("State")+
  ylab("Fatalities per 100,000 Licensed Drivers")+
  coord_flip() +
  theme_minimal()+
  ggtitle("Fatalities per 100,000 Licensed Drivers by State")+
  scale_fill_brewer(palette="Spectral")

Driver Actions

Many fatal crashes are caused by specific driver actions, some of the most common are speeding, drinking while driving, and distracted driving. All of these actions contribute to the overall quality of the driver. By looking at these actions by state it shows how many fatal crashes were caused by these bad actions. This allows for a better understanding of the drivers in each state.

Speeding

In this situation, speeding is defined as higher than the posted speed limit. Alaska has the highest percentage of drivers who were speeding in fatal collisions. Both Mississippi and Florida have a relatively low percentage.

#Create new data frame with specific driving condition
Graph2<-cbind("Drivers Who Were Not Speeding in Fatal Collisions" =CleanData$Drivers-CleanData$Speeding_Fatalities,
              "The Drivers Who Were Speeding in Fatal Collisions"=CleanData$Speeding_Fatalities)
library(reshape)
NewGraph2<-melt(Graph2)
NewGraph2<-cbind(NewGraph2, "State"=rep(CleanData$State,2))
names(NewGraph2)<-c("Rep", "Catergory","Value","State")
#Graph percentage bar plot
ggplot(NewGraph2, aes(fill=Catergory, y=Value, x=State)) + 
    ylab("Fatalities")+
    coord_flip()+
    scale_fill_brewer(palette="RdPu")+
    geom_bar(position="Fill", stat="identity")+
    theme_minimal()+
    ggtitle("Percentage of Drivers Who Were Speeding in Fatal Collisions")

Drinking

In this situation, drinking is defined as having a blood alcohol concentration over the legal limit (.08). Montana has the highest percentage of drivers who were drinking in fatal collisions. This number is extremely high compared to other states at 43%.

#Create new data frame with specific driving condition
Graph4<-cbind("Drivers Who were Not Drinking in Fatal Collisions"
              =CleanData$Drivers-CleanData$Alcohol_Fatalities,
              "The Drivers Who were Drinking in Fatal Collisions"=CleanData$Alcohol_Fatalities)
NewGraph4<-melt(Graph4)
NewGraph4<-cbind(NewGraph4, "State"=rep(CleanData$State,2))
names(NewGraph4)<-c("Rep", "Catergory","Value","State")
#Graph percentage bar plot
ggplot(NewGraph4, aes(fill=Catergory, y=Value, x=State)) + 
    ylab("Fatalities")+
    coord_flip()+
    scale_fill_brewer(palette="Blues")+
    geom_bar(position="fill", stat="identity")+
    theme_minimal()+
    ggtitle("Percentage of Drivers Who Were Drinking in Fatal Collisions")

Distracted

In this situation, distracted is defined as anything taking attention away from the road (phone, talking, eating, etc.). New Mexico has the highest percentage of distracted drivers involved in fatal collision, and is closely followed by Kansas.

#Create new data frame with specific driving condition
Graph5<-cbind("Drivers Who Were Not Distracted in Fatal Collisions"=CleanData$Drivers-CleanData$Distracted_Fatalities,
              "The Drivers Who Were Distracted in Fatal Collisions"=CleanData$Distracted_Fatalities)
NewGraph5<-melt(Graph5)
NewGraph5<-cbind(NewGraph5, "State"=rep(CleanData$State,2))
names(NewGraph5)<-c("Rep", "Catergory","Value","State")
#Graph percentage bar plot
ggplot(NewGraph5, aes(fill=Catergory, y=Value, x=State)) + 
    ylab("Fatalities")+
    coord_flip()+
    scale_fill_brewer(palette="Oranges")+
    geom_bar(position="fill", stat="identity")+
    theme_minimal()+
    ggtitle("Percentage of Drivers Who Were Distracted in Fatal Collisions")

Maps

Insurance Losses

#Import libraries used to create maps
library(ggthemes)
library(usmap)
CleanData$state<-DriverData$State
#Creates Insurance Loss Map
plot_usmap(data= CleanData, values = "Insurance_Losses") +
  scale_fill_gradient(low="white",high="skyblue4") +
  labs(title = "Annual Losses by Insurance Companies for Collisions per Insured Driver") +
  theme_map() + labs(fill = "Dollars")+
  theme(legend.position = "right")

Average Auto Premium

#Create Auto Premium Map
plot_usmap(data= CleanData, values = "Auto_Premium")+
  scale_fill_gradient(low="white",high="seagreen") +
  labs(title = "Average Annual Auto Insurance Premium") +
  theme_map() + labs(fill = "Dollars")+
  theme(legend.position = "right")

Road Congestion

#Create Traffic Congestion Map
plot_usmap(data= CleanData, values = "Traffic_Congestion")+
  scale_fill_gradient(low="white",high="mediumorchid4") +
  labs(title = "Millions of Miles Driven per Mile of Roadway") +
  theme_map() + labs(fill = "Millions of Miles Driven per Mile of Roadway")+
  theme(legend.position = "right")

Fatalities per Licensed Drivers

#Create Fatalities per Licensed Driver Map
plot_usmap(data= CleanData, values = "Fatalities_Per_Driver")+
  scale_fill_gradient(low="white",high="indianred3") +
  labs(title = "Fatalities per 100,000 Licensed Drivers") +
  theme_map() + labs(fill = "Fatalities per 100,000 Licensed Drivers")+
  theme(legend.position = "right")

Fatalities per Million VMT

#Create Fatalities per Million VMT Map
plot_usmap(data= CleanData, values = "Fatality_Rate")+
  scale_fill_gradient(low="white",high="royalblue4") +
  labs(title = "Fatality Rate per Million VMT") +
  theme_map() + labs(fill = "Fatality Rate per Million VMT")+
  theme(legend.position = "right")

Conclusion

Findings

By analyzing the key variables, I determined which region and state has the worst drivers. It was found that the West South Central region, which contains Arkansas, Louisiana, Oklahoma, and Texas have the highest likelihood of encountering poor driving. This conclusion is supported by both the ranking system and numerous visualizations. Additionally, it was found that Louisiana, which is part of this region, stands out as the state with a greatest portion of bad drivers. The results from the ANOVA test confirm that there is a significant difference in the rankings of the regions.

Continued Research

Research about this topic should always been continued, especially as newer data is released. It would be interesting to include data about all collisions(not just fatal collisions) and other objective qualities of driving ability such as number traffic violations(speeding tickets, running red lights, etc.) or information from phone apps which monitor driving such like Life360.

Limitations

As with every project, this analysis does have limitations. The biggest concern is that these finding are based on data from 2017 or 2018. It would be better to have more recent data as it become available. Another limitation is that it is difficult to determine if these findings are statistically significant at a state level. The ranking system gives a good insight into the states and regions with both good and bad drivers but it was not statistically confirmed that these rankings are significant.

References

National Highway Traffic Safety Administration

National Association of Insurance Commissioners

AAA Foundation

US Census Bureau

About Me!

#Include picture
knitr::include_graphics('LanaWilliams.png')

Hello! My name is Lana Williams and I am a 2nd year undergraduate student at Clemson University, originally from Bloomington, Illinois. I am majoring in mathematical sciences with an emphasis in statistics and a minor in computer science. In my free time, I enjoy paddle boarding, solving my Rubix’s cube, and hiking with my dog Lily.

STAT 4020 Final Project

Lana Williams

Abstract

Introduction

Which State Has the Worst Drivers?

Analyzing the Problem

Key Variables

Previous Research on This Topic

Data

Packages

Cleaning

Clean Data

Estimating Missing Values

Rhode Island

Texas

Regions

Description of Data

Obtaining Data

Variables

Interactive Datatable

Summary Statistics

Analysis

Ranking

Region Rankings

Ranking Variables of Interest

Final Rankings

State Rankings

Ranking Variables of Interest

Final Rankings

ANOVA

Hypothesis and Assumptions

Checking Assumptions

Summary Statistics

Normality

Constant Variance

Mean of 0

Independence

Final ANOVA

ANOVA Results

Visualizations

Region

Insurance Losses

Auto Premium

Road Congestion

Fatalities per Licensed Drivers

Fatalities per Million VMT

State

Graphs

Driver Actions

Speeding

Drinking

Distracted

Maps

Insurance Losses

Average Auto Premium

Road Congestion

Fatalities per Licensed Drivers

Fatalities per Million VMT

Conclusion

Findings

Continued Research

Limitations

References

About Me!