Every state has bad drivers, but some states seem to have a greater amount than others. The purpose of this study is to determine which region and state of the United States has the worst drivers. In order to determine this, a ranking system was created that ranked 5 different key variables: Losses Incurred by Insurance Companies, Average Auto Insurance Premium, Road Congestion, Fatalities per Licensed Driver, and Fatalities per Million Vehicle Miles Traveled. These key variables were taken from published reports by The National Highway Traffic Safety Administration (NHTSA) and The National Association of Insurance Commissioners (NAIC). In addition, an Analysis of Variance (ANOVA) test was performed that showed that there is a significant difference in the rankings by region. After the analysis was performed, it was determined that the West South Central region is the region with the worst drivers and Louisiana is the state with the worst drivers. These conclusions can also be found by looking at the data and numerous visualizations throughout this report.
In a survey done by the AAA Foundation, it was found that in 2016-2017, Americans spent on average, 51 minutes in the car every day. This totals to 310 hours behind the wheel annually. With all this time in the car, a motorist will likely experience a bad driver. Whether it is the congested interstates of California, the substantial aging population in Florida, or the aggressive drivers of New York, most Americans claim that their own state has the worst drivers. However, while every state has bad drivers, some have more in comparison to others.
This report looks at a variety of data sources to determine which state really has the worst drivers. To do this, I created a ranking system that ranks each of the key variables. After each category is ranked, the total rank is computed and the state with the highest rank indicating the worst drivers. I looked at this problem from both a region and a state prospective and ranked each using the ranking system. In addition, an ANOVA test was completed using the final weighted rank by region in order to determine if there was a significant difference in the rankings by region. Aside from the rankings and ANOVA, I also created various visualizations which help better understand some of the reasons behind these fatal crashes and the key variables.
In 2014, FiveThirtyEight released an article that analyzed which state had the worst drivers. The data used in the analysis is from 2009-2012. Instead of doing my own analysis using this data, I choose to find and analyze more recent data from 2017-2018. Finding new data allows for the results to be more relevant to today’s society.
The following R libraries are required in order to run the code used in this project. In addition, all of the code for this project featured throughout can be viewed using the “code” tab.
#Creating Package Table
Packages <-data.table(Package=c("rmdformats","DT","readxl","tidyverse","data.table","RColorBrewer","dplyr","rstatix","ggpubr","reshape","usmaps","ggtheme"),
Use=c("R Markdown/HTML Template","Displaying Data Tables","Reading Excel Data into R","Data Manipulation","Displaying Data Tables",
"Color Palettes for Graphs","Data Manipulation","Statistical Tests","Statistical Tests","Data Manipluation","Template for creating population maps",
"Themes for graphs and maps"))
knitr::kable(Packages)| Package | Use |
|---|---|
| rmdformats | R Markdown/HTML Template |
| DT | Displaying Data Tables |
| readxl | Reading Excel Data into R |
| tidyverse | Data Manipulation |
| data.table | Displaying Data Tables |
| RColorBrewer | Color Palettes for Graphs |
| dplyr | Data Manipulation |
| rstatix | Statistical Tests |
| ggpubr | Statistical Tests |
| reshape | Data Manipluation |
| usmaps | Template for creating population maps |
| ggtheme | Themes for graphs and maps |
In order to prepare for analysis, the following actions were performed on the data:
#Importing Data
DriverData <- read_xlsx("Project Data.xlsx",col_names=TRUE,na="NA")
#Renaming columns for easier use
CleanData<-
rename(DriverData,
Fatality_Rate= `Fatality Rate per Millon VMT`,
Drivers= `Drivers Involved in Fatal Crashes`,
Speeding_Fatalities =`Speeding-Related Fatalities`,
Alcohol_Fatalities =`Alcohol-Impaired Driving Fatalities (BAC=.08+)`,
Distracted_Fatalities =`Distracted Driving Fatalities`,
Auto_Premium =`Average Car Insurance Premium`,
Insurance_Losses =`Losses incurred by insurance companies for collisions per insured driver`,
Licensed_Drivers=`Licensed Drivers`,
Traffic_Congestion=`Millions of Miles Driven per Mile of Roadway`
)
#Adding Fatalities_Per_Driver column by dividing Fatalities by (Licensed_Drivers/100000)
CleanData <- transform(CleanData, Fatalities_Per_Driver= c(Fatalities/(Licensed_Drivers/100000)))
#Rounding Fatalities_Per_Driver to 6 places
CleanData <- mutate_at(CleanData,vars(Fatalities_Per_Driver), funs(round(., 6)))There are 2 missing values in the data
For better analysis, the values were filled in using an average common ratio between variables. This was the best method given the state’s total population. With Rhode Island being one of the smallest states and Texas being one of the largest states, using just the mean of each category for the missing value would have been an inadequate solution.
mean_Fatalities <- mean(CleanData$Drivers)
mean_distracted <- mean(CleanData$Distracted_Fatalities, na.rm=TRUE)
#Average ratio
Ratio1 <- mean_distracted/mean_Fatalities
#Computing missing value
RhodeDistracted<- CleanData$Fatalities[40]*Ratio1
RhodeDistracted<- round(RhodeDistracted,digits=0)
#Filling missing value
CleanData$Distracted_Fatalities[40]<-RhodeDistractedCommon Ratio between distracted drivers in fatal collisions and total drivers in fatal collisions: 0.0528227
Missing Value: 3
mean_Premium <- mean(CleanData$Auto_Premium)
mean_Loss <- mean(CleanData$Insurance_Losses, na.rm=TRUE)
#Average ratio
Ratio2 <- mean_Loss/mean_Premium
#Computing missing value
TexasLoss<- CleanData$Auto_Premium[44]*Ratio2
TexasLoss<- round(TexasLoss,digits=2)
#Filling missing value
CleanData$Insurance_Losses[44]<-TexasLossCommon Ratio between Insurance Loss and Auto Premium: 0.2040097
Missing Value: 279.95
# Splitting all the states into regions
CleanData$Region <- ifelse(CleanData$State %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"), "East North Central",
ifelse(CleanData$State %in% c("Alabama", "Kentucky", "Mississippi", "Tennessee"), "East South Central",
ifelse(CleanData$State %in% c("New Jersey", "New York", "Pennsylvania"), "Middle Atlantic",
ifelse(CleanData$State %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming"), "Mountain",
ifelse(CleanData$State %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"), "New England",
ifelse(CleanData$State %in% c("Alaska", "California", "Hawaii", "Oregon", "Washington"), "Pacific",
ifelse(CleanData$State %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "Dist of Columbia", "West Virginia"), "South Atlantic",
ifelse(CleanData$State %in% c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"), "West North Central",
ifelse(CleanData$State %in% c("Arkansas", "Louisiana", "Oklahoma", "Texas"), "West South Central", "NA")))))))))Most of the data analyzed in the project was originally published by The National Highway Traffic Safety Administration or The National Association of Insurance Commissioners.
All data is annual.
#Creating table to display description of variables and their source
DesTable <- data.table(Variables=colnames(CleanData),
Description=c('50 US States and District of Columbia', 'Fatalities From Fatal Collisions',
'Fatalities per Million Vehicle Miles Traveled',
'Drivers Involved in Fatal Collisions',
'Drivers Involved in Fatal Collisions Who Were Speeding',
'Drivers Involved in Fatal Collisions Who Were Alcohol-Impaired',
'Drivers Involved in Fatal Collisions Who Were Distracted',
'Average Auto Insurance Premium',
'Losses Incurred by Insurance Companies for Collisions per Insured Driver',
'Number of Licensed Drivers',
'Millions of Miles Driven per Mile of Roadway',
'Fatalities per 100,000 Licensed Drivers',
'Region of the US'),
Source=c('N/A','National Highway Traffic Safety Administration, 2018',
'National Highway Traffic Safety Administration, 2018',
'National Highway Traffic Safety Administration, 2018',
'National Highway Traffic Safety Administration, 2018',
'National Highway Traffic Safety Administration, 2018',
'National Highway Traffic Safety Administration, 2018',
'National Association of Insurance Commissioners, 2018',
'National Association of Insurance Commissioners, 2017',
'National Highway Traffic Safety Administration, 2018',
'National Association of Insurance Commissioners, 2017',
'Calculated','U.S. Census Bureau'))
#Displaying Table
knitr::kable(DesTable)| Variables | Description | Source |
|---|---|---|
| State | 50 US States and District of Columbia | N/A |
| Fatalities | Fatalities From Fatal Collisions | National Highway Traffic Safety Administration, 2018 |
| Fatality_Rate | Fatalities per Million Vehicle Miles Traveled | National Highway Traffic Safety Administration, 2018 |
| Drivers | Drivers Involved in Fatal Collisions | National Highway Traffic Safety Administration, 2018 |
| Speeding_Fatalities | Drivers Involved in Fatal Collisions Who Were Speeding | National Highway Traffic Safety Administration, 2018 |
| Alcohol_Fatalities | Drivers Involved in Fatal Collisions Who Were Alcohol-Impaired | National Highway Traffic Safety Administration, 2018 |
| Distracted_Fatalities | Drivers Involved in Fatal Collisions Who Were Distracted | National Highway Traffic Safety Administration, 2018 |
| Auto_Premium | Average Auto Insurance Premium | National Association of Insurance Commissioners, 2018 |
| Insurance_Losses | Losses Incurred by Insurance Companies for Collisions per Insured Driver | National Association of Insurance Commissioners, 2017 |
| Licensed_Drivers | Number of Licensed Drivers | National Highway Traffic Safety Administration, 2018 |
| Traffic_Congestion | Millions of Miles Driven per Mile of Roadway | National Association of Insurance Commissioners, 2017 |
| Fatalities_Per_Driver | Fatalities per 100,000 Licensed Drivers | Calculated |
| Region | Region of the US | U.S. Census Bureau |
#Getting count, mean, and standard deviation of every variable
SummaryStats<-get_summary_stats(CleanData,type="mean_sd")
SummaryStats<-rename(SummaryStats,
Variable='variable',
Count='n',
Mean='mean',
'Standard Deviation'='sd')
knitr::kable(SummaryStats)| Variable | Count | Mean | Standard Deviation |
|---|---|---|---|
| Alcohol_Fatalities | 51 | 209.569 | 265.160 |
| Auto_Premium | 51 | 1131.772 | 240.974 |
| Distracted_Fatalities | 51 | 52.765 | 60.824 |
| Drivers | 51 | 1017.745 | 1146.004 |
| Fatalities | 51 | 721.471 | 806.329 |
| Fatalities_Per_Driver | 51 | 16.921 | 6.028 |
| Fatality_Rate | 51 | 1.139 | 0.277 |
| Insurance_Losses | 51 | 231.854 | 46.297 |
| Licensed_Drivers | 51 | 4461929.118 | 4889545.162 |
| Speeding_Fatalities | 51 | 185.314 | 202.484 |
| Traffic_Congestion | 51 | 0.854 | 0.579 |
In order to figure out which state has the worst drivers; I created a ranking system which ranks each state and region based on various categories.
Function created for ranking categories:
```r
Ranks<-function(x,HighToLow=TRUE){
a<-rank(x,ties.method="min")
n<-length(x)+1
#Ascending Order
if(!HighToLow){
return(a)
#Descending Order
}else{
b<-n-a
return(b)
}
}
```
#Creating subset of variables of interest
RegionData<-aggregate.data.frame(CleanData[,c(3,8,9,11,12)],list(CleanData$Region),mean)
Region<-RegionData$Group.1
#Ranking each category
Loss_Rank <- Ranks(x=RegionData$Insurance_Losses,HighToLow=TRUE)
Premium_Rank<-Ranks(x=RegionData$Auto_Premium,HighToLow=TRUE)
Road_Congestion_Rank<-Ranks(x=RegionData$Traffic_Congestion,HighToLow=FALSE)
Fatalities_Per_Driver_Rank<-Ranks(x=RegionData$Fatalities_Per_Driver,HighToLow=TRUE)
Fatalities_Per_Miles_Traveled_Rank<-Ranks(x=RegionData$Fatality_Rate, HighToLow=TRUE)
#Creating data frame of ranks
RegionRank<-data.frame(Region,Loss_Rank,Premium_Rank,Road_Congestion_Rank,Fatalities_Per_Driver_Rank,Fatalities_Per_Miles_Traveled_Rank)
datatable(RegionRank,rownames=FALSE)Weighted Rank Formula: \[ \tiny .2*LossRank+.2*PremiumRank+.1*RoadCongestionRank+.3*FatalitiesPerDriverRank+.2*FatalitiesPerMilesTraveledRank \]
#Computing overall rank
WeightedRank<-.2*Loss_Rank+.2*Premium_Rank+.1*Road_Congestion_Rank+.3*Fatalities_Per_Driver_Rank+.2*Fatalities_Per_Miles_Traveled_Rank
#Final Ranking
FinalRegionRank<-Ranks(WeightedRank, HighToLow=FALSE)
FinalRegionRankTable<-data.frame(Region,WeightedRank,FinalRegionRank)
datatable(FinalRegionRankTable,rownames=FALSE)The rank system establishes that the region with the worst drivers is the West South Central region with an average weighted rank of 2.4. This region includes Texas, Oklahoma, Arkansas, and Louisiana.
#Ranking each category
Loss_Rank<- Ranks(x=CleanData$Insurance_Losses,HighToLow=TRUE)
Premium_Rank<-Ranks(x=CleanData$Auto_Premium,HighToLow=TRUE)
Road_Congestion_Rank<-Ranks(x=CleanData$Traffic_Congestion,HighToLow=FALSE)
Fatalities_Per_Driver_Rank<-Ranks(x=CleanData$Fatalities_Per_Driver,HighToLow=TRUE)
Fatalities_Per_Miles_Traveled_Rank<-Ranks(x=CleanData$Fatality_Rate, HighToLow=TRUE)
State<-CleanData$State
#Creating data frame of ranks
StateRank<-data.frame(State,Loss_Rank,Premium_Rank,Road_Congestion_Rank,Fatalities_Per_Driver_Rank,Fatalities_Per_Miles_Traveled_Rank)
datatable(StateRank)Weighted Rank Formula: \[ \tiny .2*LossRank+.2*PremiumRank+.1*RoadCongestionRank+.3*FatalitiesPerDriverRank+.2*FatalitiesPerMilesTraveledRank \]
#Computing overall rank
WeightedStateRank<-+.2*Loss_Rank+.2*Premium_Rank+.1*Road_Congestion_Rank+.3*Fatalities_Per_Driver_Rank+.2*Fatalities_Per_Miles_Traveled_Rank
#Final Ranking
FinalStateRank<-Ranks(WeightedStateRank, HighToLow=FALSE)
FinalStateRankTable<-data.frame(State,WeightedStateRank,FinalStateRank)
datatable(FinalStateRankTable,rownames=FALSE)From the ranking system, it can be seen that Louisiana is the state with the worst drivers. Louisiana has a weighted rank of 7.8 and is ranked #1 in Auto Insurance Premium, #3 in losses incurred by insurance companies, and #4 in fatalities per million miles traveled.
Analysis of Variance (ANOVA) is a statistical method which analyzes variance to determine if the means from multiple populations are the same. If the average variation between groups is large in comparison to the other groups, it can be concluded that at least one group’s mean is significantly different than the others. This test was chosen because the data that is being analyzed has a categorical dependent variable with various quantitative independent variables combined into one weighted rank.
H0 : all means are equal
HA : at least one mean is different
Before an ANOVA test can be performed, the following assumptions need to be made:
Errors are normally distributed.
Errors have constant variance.
Errors have a mean of O.
Errors are independent.
#Subset of data used in ANOVA
RegionData<- data.table("Region"=CleanData$Region,"State"=CleanData$State, "WeightedRank"=WeightedStateRank)
AssumpData <-RegionData
#Grouping by region and getting summary stats
GroupedData <-group_by(AssumpData, Region)
Summary<-get_summary_stats(GroupedData,WeightedRank,type="mean_sd")
#Displaying summary stats in table and graph
Summary<-as.data.frame(Summary)
datatable(Summary) From the graph above, it is clear to see that even before the ANOVA test is performed, the West South Central and East South Central regions seem to have a higher weighted rank. While the New England, East North Central, and West North Central regions seem to have a lower weighted rank. This helps give a preview into what to expect in the ANOVA results.
Analyzing the ANOVA model residuals using QQ plot and Shapiro-Wilk test of normality
# Linear Model
model <-lm(WeightedRank~Region, data=AssumpData)
# QQ plot of residuals
ggqqplot(residuals(model))| variable | statistic | p.value |
|---|---|---|
| residuals(model) | 0.9719217 | 0.2653565 |
In the QQ plot, all the points fall relatively along the reference line. From this, it can concluded there is no enough evidence to declare non-normality. The results of the Shapiro-Wilk test also support this conclusion. The p-value is not significant for any reasonable alpha which confirms there is not enough evidence to suggest the errors are not normally distributed.
Constant Variance is analyzed by looking at the residuals versus fits plot and Levene’s test
| df1 | df2 | statistic | p |
|---|---|---|---|
| 8 | 42 | 0.8147201 | 0.5937948 |
By looking at the residual plot, it can be seen that there is no clear relationship between residuals and fitted values(mean of each group). Therefore, it can be assumed that the errors have a constant variance. This result is confirmed by Levene’s test, the p-value is greater than any reasonably alpha level meaning there is not sufficient evidence to suggest there are differences in the variances among the groups.
There is no way to check, so it is assumed to be true.
Each state belongs to one individual region. Therefore, there is no repetition, thus no relationship between the different regions making them independent. The errors can be assumed to be independent as well.
# ANOVA Model
aov.p<- RegionData %>% anova_test(WeightedRank~Region)
knitr::kable(aov.p,caption="ANOVA Table")| Effect | DFn | DFd | F | p | p<.05 | ges |
|---|---|---|---|---|---|---|
| Region | 8 | 42 | 6.278 | 2.46e-05 | * | 0.545 |
From the ANOVA table, it can be seen that there is a significant difference between group means given p=2.46e-05 is less than alpha=.05. Therefore, the null hypothesis of equal means is rejected. The ges column is the generalized eta squared. This is a proportion that measures the variability of the outcome variable in terms of the predictor or group. In terms of the ANOVA preformed above, the ges=.545, meanings that 54.5% of the variation in WeightedRank can be accounted for by the Region.
In addition to the ranking system and ANOVA test, creating visualizations allows trends and patterns in the data to be more easily seen.
By looking at the key variables in bar graphs, it emphasizes that the West South Central region has the worst drivers.
#Average Insurance Loss data by region
RegionMap<-aggregate.data.frame(CleanData$Insurance_Losses,list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Loss=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Loss)) +
geom_bar(stat = "identity",fill="skyblue3") +
xlab("Region")+
ylab("Annual Losses per Driver")+
coord_flip() +
theme_minimal()+
ggtitle("Losses by Insurance Companies for Collisions per Insured Driver by Region") In the figure above, it can be seen that the West South Central and Middle Atlantic regions have the highest amount of losses incurred by insurance companies while the West North Central region has the least.
#Average Road Congestion data by region
RegionMap<-aggregate.data.frame(CleanData$Traffic_Congestion,list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Congestion=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Congestion)) +
geom_bar(stat = "identity",fill="mediumorchid3") +
xlab("Region")+
ylab("Millions of Miles Driven per Mile of Roadway")+
coord_flip() +
theme_minimal()+
ggtitle("Millions of Miles Driven per Mile of Roadway by Region") In the figure above, it can be seen that the South Atlantic, Pacific, and New England regions have the greatest number of miles driver per mile of roadway. This makes sense as the South Atlantic region contains Florida, the Pacific region contains California, and the New England region contains New York.
#Average Fatalities per Licensed Driver data by region
RegionMap<-aggregate.data.frame(CleanData[,12],list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Fatalities=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Fatalities)) +
geom_bar(stat = "identity",fill="indianred3") +
xlab("Region")+
ylab("Fatalities per Licensed Driver")+
coord_flip() +
theme_minimal()+
ggtitle("Fatalities per 100,000 Licensed Drivers by Region") In the figure above, it can be seen that the West South Central and East South Central regions have the greatest rate of fatalities per 100,000 licensed drivers. It can also be seen than the New England and Middle Atlantic have a smaller rate.
#Average Fatalities per Million VMT data by region
RegionMap<-aggregate.data.frame(CleanData[,3],list(CleanData$Region),mean)
RegionMap<- rename(RegionMap,Region=`Group.1`,Fatalities=`x`)
#Graph bar plot
ggplot(RegionMap, aes(x=Region, y = Fatalities)) +
geom_bar(stat = "identity",fill="royalblue3") +
xlab("Region")+
ylab("Fatalities per Million Vehicle Miles Traveled")+
coord_flip() +
theme_minimal()+
ggtitle("Fatalities per Million Vehicle Miles Traveled by Region") In the figure above, it can be seen that the West South Central and East South Central regions have the highest rate of fatalities per million vehicle miles traveled.
Fatalities per Licensed Drivers is one of the most important variables in this analysis, accounting for 30% of the weighted rank. This visualization shows the breakdown of this ratio by state, while being filled by region.
#graph fatalities per 100,000 licensed drivers by state filled by region
ggplot(CleanData, aes(x=reorder(State,Fatalities_Per_Driver), y = Fatalities_Per_Driver, fill = Region)) +
geom_bar(stat = "identity") +
xlab("State")+
ylab("Fatalities per 100,000 Licensed Drivers")+
coord_flip() +
theme_minimal()+
ggtitle("Fatalities per 100,000 Licensed Drivers by State")+
scale_fill_brewer(palette="Spectral")Many fatal crashes are caused by specific driver actions, some of the most common are speeding, drinking while driving, and distracted driving. All of these actions contribute to the overall quality of the driver. By looking at these actions by state it shows how many fatal crashes were caused by these bad actions. This allows for a better understanding of the drivers in each state.
In this situation, speeding is defined as higher than the posted speed limit. Alaska has the highest percentage of drivers who were speeding in fatal collisions. Both Mississippi and Florida have a relatively low percentage.
#Create new data frame with specific driving condition
Graph2<-cbind("Drivers Who Were Not Speeding in Fatal Collisions" =CleanData$Drivers-CleanData$Speeding_Fatalities,
"The Drivers Who Were Speeding in Fatal Collisions"=CleanData$Speeding_Fatalities)
library(reshape)
NewGraph2<-melt(Graph2)
NewGraph2<-cbind(NewGraph2, "State"=rep(CleanData$State,2))
names(NewGraph2)<-c("Rep", "Catergory","Value","State")
#Graph percentage bar plot
ggplot(NewGraph2, aes(fill=Catergory, y=Value, x=State)) +
ylab("Fatalities")+
coord_flip()+
scale_fill_brewer(palette="RdPu")+
geom_bar(position="Fill", stat="identity")+
theme_minimal()+
ggtitle("Percentage of Drivers Who Were Speeding in Fatal Collisions")In this situation, drinking is defined as having a blood alcohol concentration over the legal limit (.08). Montana has the highest percentage of drivers who were drinking in fatal collisions. This number is extremely high compared to other states at 43%.
#Create new data frame with specific driving condition
Graph4<-cbind("Drivers Who were Not Drinking in Fatal Collisions"
=CleanData$Drivers-CleanData$Alcohol_Fatalities,
"The Drivers Who were Drinking in Fatal Collisions"=CleanData$Alcohol_Fatalities)
NewGraph4<-melt(Graph4)
NewGraph4<-cbind(NewGraph4, "State"=rep(CleanData$State,2))
names(NewGraph4)<-c("Rep", "Catergory","Value","State")
#Graph percentage bar plot
ggplot(NewGraph4, aes(fill=Catergory, y=Value, x=State)) +
ylab("Fatalities")+
coord_flip()+
scale_fill_brewer(palette="Blues")+
geom_bar(position="fill", stat="identity")+
theme_minimal()+
ggtitle("Percentage of Drivers Who Were Drinking in Fatal Collisions")In this situation, distracted is defined as anything taking attention away from the road (phone, talking, eating, etc.). New Mexico has the highest percentage of distracted drivers involved in fatal collision, and is closely followed by Kansas.
#Create new data frame with specific driving condition
Graph5<-cbind("Drivers Who Were Not Distracted in Fatal Collisions"=CleanData$Drivers-CleanData$Distracted_Fatalities,
"The Drivers Who Were Distracted in Fatal Collisions"=CleanData$Distracted_Fatalities)
NewGraph5<-melt(Graph5)
NewGraph5<-cbind(NewGraph5, "State"=rep(CleanData$State,2))
names(NewGraph5)<-c("Rep", "Catergory","Value","State")
#Graph percentage bar plot
ggplot(NewGraph5, aes(fill=Catergory, y=Value, x=State)) +
ylab("Fatalities")+
coord_flip()+
scale_fill_brewer(palette="Oranges")+
geom_bar(position="fill", stat="identity")+
theme_minimal()+
ggtitle("Percentage of Drivers Who Were Distracted in Fatal Collisions")#Import libraries used to create maps
library(ggthemes)
library(usmap)
CleanData$state<-DriverData$State
#Creates Insurance Loss Map
plot_usmap(data= CleanData, values = "Insurance_Losses") +
scale_fill_gradient(low="white",high="skyblue4") +
labs(title = "Annual Losses by Insurance Companies for Collisions per Insured Driver") +
theme_map() + labs(fill = "Dollars")+
theme(legend.position = "right")#Create Traffic Congestion Map
plot_usmap(data= CleanData, values = "Traffic_Congestion")+
scale_fill_gradient(low="white",high="mediumorchid4") +
labs(title = "Millions of Miles Driven per Mile of Roadway") +
theme_map() + labs(fill = "Millions of Miles Driven per Mile of Roadway")+
theme(legend.position = "right")#Create Fatalities per Licensed Driver Map
plot_usmap(data= CleanData, values = "Fatalities_Per_Driver")+
scale_fill_gradient(low="white",high="indianred3") +
labs(title = "Fatalities per 100,000 Licensed Drivers") +
theme_map() + labs(fill = "Fatalities per 100,000 Licensed Drivers")+
theme(legend.position = "right")#Create Fatalities per Million VMT Map
plot_usmap(data= CleanData, values = "Fatality_Rate")+
scale_fill_gradient(low="white",high="royalblue4") +
labs(title = "Fatality Rate per Million VMT") +
theme_map() + labs(fill = "Fatality Rate per Million VMT")+
theme(legend.position = "right")By analyzing the key variables, I determined which region and state has the worst drivers. It was found that the West South Central region, which contains Arkansas, Louisiana, Oklahoma, and Texas have the highest likelihood of encountering poor driving. This conclusion is supported by both the ranking system and numerous visualizations. Additionally, it was found that Louisiana, which is part of this region, stands out as the state with a greatest portion of bad drivers. The results from the ANOVA test confirm that there is a significant difference in the rankings of the regions.
Research about this topic should always been continued, especially as newer data is released. It would be interesting to include data about all collisions(not just fatal collisions) and other objective qualities of driving ability such as number traffic violations(speeding tickets, running red lights, etc.) or information from phone apps which monitor driving such like Life360.
As with every project, this analysis does have limitations. The biggest concern is that these finding are based on data from 2017 or 2018. It would be better to have more recent data as it become available. Another limitation is that it is difficult to determine if these findings are statistically significant at a state level. The ranking system gives a good insight into the states and regions with both good and bad drivers but it was not statistically confirmed that these rankings are significant.
National Highway Traffic Safety Administration
Hello! My name is Lana Williams and I am a 2nd year undergraduate student at Clemson University, originally from Bloomington, Illinois. I am majoring in mathematical sciences with an emphasis in statistics and a minor in computer science. In my free time, I enjoy paddle boarding, solving my Rubix’s cube, and hiking with my dog Lily.