This project uses the dataset called, “US Accidents (3.5 million records) A Countrywide Traffic Accident Dataset (2016 - 2020),” to analyze particular factors that contribute to accidents that cause most interference in traffic. The dataset contain all sorts of information about each accident. There are a total of 2974335 observations with 49 variables for each observation in the dataset. Some of the variables are categorical. They mention if something was present at the time of the accident or where exactly the accident occurred and the time of day. Other variables are quantitative. They mention how much of something was present during the accident, such as precipitation. This study uses the ordinal variable, Severity, as a response variable. Severity assigns a value from 1 to 4 to describe how much impact the accident had on traffic with 4 being the most amount of interference and 1 being least amount of interference. This study is intended to use the other variables in this dataset to determine which of these variables contributes to some more traffic impacting accidents.
This code chunk loads each R package used in this project.
library(plyr)
library(maps)
library(lubridate)
#library(tidyverse)
library(caret)
library(zoo)
Lines 30-447 are used to visualize the data before making any models.
This code chunk imports a subset of the accidents data to be used for visualization purposes. This subset of a random sample of 5000 of the 2974335 obvervations in this data set.
#accidents = read.csv("C:/Users/Max Billante/Documents/Machine Learning/US Accidents Dec19.csv")
accidents = read.csv("C:/Users/Max Billante/Documents/Machine Learning/us accidents small.csv")
This code chunk makes a frequency table of the variable severity. From the table it is seen that most of the accidents in this data set have a severity factor of 2. A severity factor of 3 is neither common nor uncommon in this data set. There are few severity factors of 1 and 4 in this data set, but 4 is more common than 1.
accidents$Severity <- as.integer(accidents$Severity)
#count_severity <- count(accidents$Severity)
count_severity <- c(sum(accidents$Severity==1), sum(accidents$Severity==2), sum(accidents$Severity==3), sum(accidents$Severity==4))
#rf <- count_severity$freq/sum(count_severity$freq)
rf <- count_severity/sum(count_severity)
#barplot(rf, names.arg = count_severity$x, main = "Relative Frequencies of Severity")
barplot(rf, names.arg = 1:4, main = "Relative Frequencies of Severity")
This chunk of code creates a map with two variables from the data set, Start_Lng and Start_Lat. We see that more accidents occur closer to the east and west borders of the United States than in the middle.
map("usa", fill = TRUE, col = "white", bg = "lightblue")
points(accidents$Start_Lng, accidents$Start_Lat, pch = 20, cex = 0.01, col = "red")
### make histograms of latitude and longitude count_lat <- count(accidents\(Start_Lat) rf_lat <- count_lat\)freq/sum(count_lat\(freq) count_long <- count(accidents\)Start_Lng) rf_long <- count_long\(freq/sum(count_long\)freq) barplot(rf_lat, names.arg = count_lat\(x, main = "Table of latitude") barplot(rf_long, names.arg = count_long\)x, main = “Table of longitude”)
This code chunk calculates the time it took to clear the roadway of the accidents in hours, with the variables Start_Time and End_Time and displays a histogram of the times. We can see from histogram that the general trend of times it took to clear the accidents is right skewed with an unexpected peak around 6 hours and an outlier of 34.24 hours in North Bend, Oregon. This outlier can be seen in the accompanying data frame. It does not throw off the general trend of the distribution. It is a collective outlier.
interval <- interval(strptime(accidents$Start_Time, "%Y-%m-%d %H:%M:%S"), strptime(accidents$End_Time, "%Y-%m-%d %H:%M:%S"))
accidents$time_in_hours <- time_length(interval, unit = "hour")
#accidents$time_in_hours <- as.factor(accidents$time_in_hours)
#summary(accidents$time_in_hours)
hist(accidents$time_in_hours, breaks=seq(0,35,0.5), xlab = "Time in Hours", main = "Histogram of Time")
time_in_hours_outliears <- accidents[ which(accidents$time_in_hours > 10), ]
time_in_hours_outliears[c(5,51,17,19)]
## Severity time_in_hours City State
## 22 2 34.23639 North Bend OR
## 1410 2 11.58139 Point Mugu Nawc CA
## 2209 4 19.05972 Blue River OR
This code chunk makes a histogram of the Distance.mi variable, which is the reported distance in miles, of road affected by the accident. We see from the histogram that the distribution of distances is right skewed with most reported distances being 0 miles. This is probably misleading though since most car accidents do not even affect more than a mile of the road, but only a few feet. From a glance, there may seem to be no values after 5 miles on this Histogram, but upon further analysis, it is seen that there are some reported accidents that took up 20-30 miles of roadway. These values can be seen in the accompanying data frame. There is one collective outlier of 62.360 miles. This accident occurred in Spring Creek, Nevada. We also see in the scatterplot below the histogram that this accidents has a severity factor of 2, opposite of what would be expected. It is possible however, that an accident affecting this amount of road space would be highly regulated by police that it would be reported with a lower amount of severity in terms of traffic interference.
#accidents$Distance.mi. <- as.factor(accidents$Distance.mi.)
hist(accidents$Distance.mi., breaks=seq(0,63,1), xlab = "Distance (miles)", main="Histogram of Distance")
distance_outliers <- accidents[ which(accidents$Distance.mi. > 10), ]
distance_outliers[c(5,12,17,19)]
## Severity Distance.mi. City State
## 561 3 19.710 Paragonah UT
## 987 2 23.880 Syracuse NY
## 1225 3 23.530 Weedsport NY
## 1614 3 28.410 Westfield NY
## 1767 3 11.720 Mayer AZ
## 1828 4 27.650 Fort Bridger WY
## 2072 2 11.545 Morton IL
## 2489 3 29.370 Wilkes Barre PA
## 2938 3 10.290 Tuscaloosa AL
## 2959 3 29.780 Lehighton PA
## 3198 3 22.590 Hillburn NY
## 3304 3 15.540 Saugerties NY
## 3504 3 23.760 Kingston NY
## 3922 4 20.605 Ganado AZ
## 4565 2 62.360 Spring Creek NV
plot(accidents$Severity, accidents$Distance.mi., xlab = "Severity", ylab = "Distance (miles)")
#accidents\(Number <- as.factor(accidents\)Number) hist(accidents$Number, breaks=seq(0,100000, 1000))
This code chunk create a frequency table of the variable Side, which indicates which side of the road in which the accident was located. We see from the table that most the accidents were located on the right side of the road as opposed from the left, which make sense since people drive on the right side of the road on one way streets and highways and steer to the left to avoid coming into contact with stopped vehicles on the right side of the road.
count_side <- count(accidents$Side)
rf_side <- count_side$freq/sum(count_side$freq)
barplot(rf_side, names.arg = count_side$x, main = "Relative Frequency of Side")
plot(accidents\(Severity, accidents\)Side)
count_state <- count(accidents\(State) rf_state <- count_state\)freq/sum(count_state\(freq) barplot(rf_state, names.arg = count_state\)x, main = “Frequency Table of State”, cex.names = 0.28, xlab = “State”, ylab = “Frequency”)
accidents\(Simplezip <- substring(accidents\)Zipcode, 1, 5) hist(accidents$Simplezip)
This code chunk makes a frequency table of the variable Timezone. It helps determine which areas of the Unites States have more of the reported accidents in this data set. We see that most of the accidents occurred in the eastern region. The western and central region are about the same, but the western region has slightly more accidents than the central region. The mountain region does not have a lot of accidents, which makes sense since not many people live there. The unlabeled flat bar is the data that was not available.
count_timezone <- count(accidents$Timezone)
rf_timezone <- count_timezone$freq/sum(count_timezone$freq)
barplot(rf_timezone, names.arg = count_timezone$x, main = "Relative Frequency of timezone")
This code chunk creates a histogram of the temperatures in Fahrenheit for each accident. The histogram shows the distribution of the temperatures are roughly skewed to the left with plenty of variation. At 0 degrees Fahrenheit, there are few accidents probably because these temperatures are not common in the United States. As the temperature starts to increase, the number of accidents goes up since more people are out when it is warmer. When the temperature continues to increase from there, the number of accidents starts to go back down since fewer people are out if it is too hot.
#accidents$Temperature.F. <- as.factor(accidents$Temperature.F.)
hist(accidents$Temperature.F., breaks=seq(-40, 120, 5), xlab = "Temperature (Fahrenheit)", main = "Histogram of Temperature")
accidents\(Temperature.F. <- as.factor(accidents\)Temperature.F.) hist(accidents\(Temperature.F.[accidents\)State==“NJ”], breaks=seq(-40, 120, 5), xlab = “Temperature.F”)
This code chunk creates a histogram of the variable Wind_Chill.F. The distribution is similar to that of the Temperature.F variable, but it takes a dive around 50 degrees and comes back up, causing the distribution to be bimodal. It is a fact that wind chill temperatures are only defined at temperatures below 50 degrees. We can see in the scatterplot below the histogram that the wind chill temperatures are only colder than the normal temperatures below 50 degrees, which is expected. It is hard to see why the distribution of accidents given wind chill is bimodal. The histogram seems to indicate that less accidents occur at wind chill temperatures around 50, but this is probably because this particular wind chill temperature is not very common.
#accidents$Wind_Chill.F. <- as.factor(accidents$Wind_Chill.F.)
hist(accidents$Wind_Chill.F., breaks=seq(-50, 120, 5), xlab = "Wind Chill (Fahrenheit)", main = "Histogram of Wind Chill")
plot(accidents$Temperature.F., accidents$Wind_Chill.F., xlab = "Temperature (Fahrenheit)")
abline(0, 1)
### analyze humidity
This code chunk creates a histogram for the Humidity variable. We see from the histogram that the number of accidents appears to increase as the humidity goes up until the humidity level reaches 50 in which the number of accidents are generally stable. There is a dive in the graph around humidity levels of 90 to 100. It makes sense for there to be less accidents in lower humidity levels because less people would not be out as often if it is to dry or these humidity levels are rare. It might be believable to have some variability in humidity levels around 100 because too much air moisture could cause some people to not want to go out, but other people would not mind. The accompanying scatter plot shows plenty of variability in humidity and temperature, so there is no apparent relationship between these two variables.
#accidents$Humidity... <- as.factor(accidents$Humidity...)
hist(accidents$Humidity..., breaks=seq(0,105,3), xlab = "Humidity", main = "Histogram of Humidity")
plot(accidents$Temperature.F., accidents$Humidity..., cex = 0.01, xlab = "Temperature (Fahrenheit)", ylab = "Humidity")
This code chunk creates a Histogram of the Pressure.in. variable. We see from the histogram that the distribution of accidents is approximately normal with a mean around 30 inches of air pressure. There are some collective outliers between air pressure of 24 to 26 inches.
#accidents$Pressure.in. <- as.factor(accidents$Pressure.in.)
hist(accidents$Pressure.in., breaks=seq(23,32,0.3), xlab = "Pressure (in.)", main = "Histogram of Pressure")
This code chunk creates a histogram of the Visibility.mi. variable. As expected more accidents occur in when visibility is low. However, the histogram shows much more accidents occurring when the visibility is around 10 miles as opposed to accidents in which the visibility is less than 9 miles. This could be explained by people being unwilling to drive if visibility is to low and 10 miles could be the one visibility that most people think is okay to drive in and end up underestimating the difficulty of the drive. Accidents in visibilities greater than 20 miles are listed in the accompanying data frame.
#accidents$Visibility.mi. <- as.factor(accidents$Visibility.mi.)
hist(accidents$Visibility.mi., breaks=seq(0,55,2), xlab = "Visibility (miles)", main = "Histogram of Visibility")
visibility_outliers <- accidents[ which(accidents$Visibility.mi. > 20), ]
visibility_outliers[c(5,29,17,19)]
## Severity Visibility.mi. City State
## 283 2 30 Mesa AZ
## 353 2 25 Broomfield CO
## 533 3 40 Mesa AZ
## 767 2 40 Mesa AZ
## 946 2 30 Denver CO
## 1254 2 30 Denver CO
## 1256 2 30 Denver CO
## 1821 3 30 Denver CO
## 2224 3 30 Denver CO
## 2540 2 40 Mesa AZ
## 2620 3 30 Denver CO
## 2639 2 50 Westminster CO
## 2834 2 50 Denver CO
## 2921 2 40 Denver CO
## 3598 3 30 Morrison CO
## 3849 3 40 Denver CO
## 4103 3 50 Denver CO
count_wd <- count(accidents\(Wind_Direction) rf_wd <- count_wd\)freq/sum(count_wd\(freq) barplot(rf_wd, names.arg = count_wd\)x, main = “Frequency Table of wind direction”)
This code chunk creates a histogram for the Wind_Speed.mph. variable. The histogram shows that distribution of accidents by wind speed is skewed to the right. Opposite of what we would expect, more accidents appear to generally occur with lower wind speeds. It is hard to explain how general wind speed could affect accidents as the direction of the wind relative to the car would need to considered. It is plausible to say that less people would be driving if the wind were too strong, which would yield less accidents. It is also plausible to say that wind speeds too fast are not common and there would not be many accidents given these wind speeds.
#accidents$Wind_Speed.mph. <- as.factor(accidents$Wind_Speed.mph.)
hist(accidents$Wind_Speed.mph., breaks=seq(0,40,3), xlab = "Wind Speed (mph)", main = "Histogram of Wind Speed")
This code chunk creates a histogram of the Precipitation.in. variable. The histogram shows that the distribution of accidents by precipitation is right skewed. Opposite of what we would expect, much of accidents in this data set have almost zero inches of precipitation. (This may be misleading since some of the observations in the Precipitation.in. variable are not available.). Other than this, there are few accidents that occur with small amounts of precipitation and an outlier with 9.99 inches of precipitation in Brooklyn, New York, shown in the accompanying data frame. This could be explained with less people driving if there is too much precipitation or very high amounts of precipitation being very rare.
#accidents$Precipitation.in. <- as.factor(accidents$Precipitation.in.)
hist(accidents$Precipitation.in., breaks=seq(0,10,0.1), xlab = "Precipitation (in.)", main = "Histogram of Precipitation")
precipitation_outliers <- accidents[ which(accidents$Precipitation.in. > 1), ]
precipitation_outliers[c(5,32,17,19)]
## Severity Precipitation.in. City State
## 1676 3 9.99 Brooklyn NY
count_weather_condition <- count(accidents\(Weather_Condition) rf_count_weather_condition <- count_weather_condition\)freq/sum(count_weather_condition\(freq) barplot(rf_count_weather_condition, names.arg = count_weather_condition\)x, main = “Frequency Table of weather condition”)
The following bar plots are all true/false variables. They mostly indicate the presence of a particular road characteristic that could possibly interfere with the amount of accidents that take place. As shown, most of the bar plots have more false characteristics than true, so the future models involve a column that counts the true characteristics for each accident in the data set.
count_amenity <- count(accidents$Amenity)
rf_amenity <- count_amenity$freq/sum(count_amenity$freq)
barplot(rf_amenity, names.arg = count_amenity$x, main = "Table of amenity")
count_bump <- count(accidents$Bump)
rf_bump <- count_bump$freq/sum(count_bump$freq)
barplot(rf_bump, names.arg = count_bump$x, main = "Table of bump")
count_crossing <- count(accidents$Crossing)
rf_crossing <- count_crossing$freq/sum(count_crossing$freq)
barplot(rf_crossing, names.arg = count_crossing$x, main = "Table of crossing")
count_give_way <- count(accidents$Give_Way)
rf_give_way <- count_give_way$freq/sum(count_give_way$freq)
barplot(rf_give_way, names.arg = count_give_way$x, main = "Table of give way")
count_junction <- count(accidents$Junction)
rf_junction <- count_junction$freq/sum(count_junction$freq)
barplot(rf_junction, names.arg = count_junction$x, main = "Table of junction")
count_no_exit <- count(accidents$No_Exit)
rf_no_exit <- count_no_exit$freq/sum(count_no_exit$freq)
barplot(rf_no_exit, names.arg = count_no_exit$x, main = "Table of no exit")
count_railway <- count(accidents$Railway)
rf_railway <- count_railway$freq/sum(count_railway$freq)
barplot(rf_railway, names.arg = count_railway$x, main = "Table of railway")
count_roundabout <- count(accidents$Roundabout)
rf_roundabout <- count_roundabout$freq/sum(count_roundabout$freq)
barplot(rf_roundabout, names.arg = count_roundabout$x, main = "Table of roundabout")
count_station <- count(accidents$Station)
rf_station <- count_station$freq/sum(count_station$freq)
barplot(rf_station, names.arg = count_station$x, main = "Table of station")
count_stop <- count(accidents$Stop)
rf_stop <- count_stop$freq/sum(count_stop$freq)
barplot(rf_stop, names.arg = count_stop$x, main = "Table of stop")
count_traffic_calming <- count (accidents$Traffic_Calming)
rf_traffic_calming <- count_traffic_calming$freq/sum(count_traffic_calming$freq)
barplot(rf_traffic_calming, names.arg = count_traffic_calming$x, main = "Table of traffic calming")
count_traffic_signal <- count(accidents$Traffic_Signal)
rf_traffic_signal <- count_traffic_signal$freq/sum(count_traffic_signal$freq)
barplot(rf_traffic_signal, names.arg = count_traffic_signal$x, main = "Table of traffic signal")
count_turning_loop <- count(accidents$Turning_Loop)
rf_turning_loop <- count_turning_loop$freq/sum(count_turning_loop$freq)
barplot(rf_turning_loop, names.arg = count_turning_loop$x, main = "Table of turning loop")
accidents\(Amenity = as.logical(accidents\)Amenity) accidents\(Bump = as.logical(accidents\)Bump) accidents\(Crossing = as.logical(accidents\)Crossing) accidents\(Give_Way = as.logical(accidents\)Give_Way) accidents\(Junction = as.logical(accidents\)Junction) accidents\(No_Exit = as.logical(accidents\)No_Exit) accidents\(Railway = as.logical(accidents\)Railway) accidents\(Roundabout = as.logical(accidents\)Roundabout) accidents\(Station = as.logical(accidents\)Station) accidents\(Stop = as.logical(accidents\)Stop) accidents\(Traffic_Calming= as.logical(accidents\)Traffic_Calming) accidents\(Traffic_Signal = as.logical(accidents\)Traffic_Signal) accidents\(Turning_Loop = as.logical(accidents\)Turning_Loop) temp <- accidents[, 34:46] temp = apply(temp, 1, sum) accidents\(num_objects <- temp accidents\)Amenity <- NULL accidents\(Bump <- NULL accidents\)Crossing <- NULL accidents\(Give_Way <- NULL accidents\)Junction <- NULL accidents\(No_Exit <- NULL accidents\)Railway <- NULL accidents\(Roundabout <- NULL accidents\)Station <- NULL accidents\(Stop <- NULL accidents\)Traffic_Calming <- NULL accidents\(Traffic_Signal <- NULL accidents\)Turning_Loop <- NULL
Object_Presence <- merge(accidents\(Amenity, accidents\)Bump, accidents\(Crossing, accidents\)Give_Way, accidents\(Junction, accidents\)No_Exit, accidents\(Railway, accidents\)Roundabout, accidents\(Station, accidents\)Stop, accidents\(Traffic_Calming, accidents\)Traffic_Signal, accidents\(Turning_Loop, by x = "True", y = "False") #count_object_presence <- count(Object_Presence) #rf_object_presence <- count_object_presence\)freq/sum(count_object_presence\(freq) #barplot(rf_object_presence, names.arg = count_object_presence\)x, main = “Relative Frequancy of Object Presence”)
This code chunk creates a frequency table for the sunrise_sunset variable. The table shows that more accidents occur during the day than overnight. This may seem contrary to what we would expect from a glance, but it is also known that less people are out on the road at night. This would explain way there are less accidents overnight.
count_sunrise_sunset <- count(accidents$Sunrise_Sunset)
rf_sunrise_sunset <- count_sunrise_sunset$freq/sum(count_sunrise_sunset$freq)
barplot(rf_sunrise_sunset, names.arg = count_sunrise_sunset$x, main = "Relative Frequency of sunrise_sunset")
count_civil_twilight <- count(accidents\(Civil_Twilight) rf_civil_twilight <- count_civil_twilight\)freq/sum(count_civil_twilight\(freq) barplot(rf_civil_twilight, names.arg = count_civil_twilight\)x, main = “Table of civil_twilight”)
count_nautical_twilight <- count(accidents\(Nautical_Twilight) rf_nautical_twilight <- count_nautical_twilight\)freq/sum(count_nautical_twilight\(freq) barplot(rf_nautical_twilight, names.arg = count_nautical_twilight\)x, main = “Table of nautical_twilight”)
count_astronomical_twilight <- count(accidents\(Astronomical_Twilight) rf_astronomical_twilight <- count_astronomical_twilight\)freq/sum(count_astronomical_twilight\(freq) barplot(rf_astronomical_twilight, names.arg = count_astronomical_twilight\)x, main = “Table of astronomical_twilight”)
plot(accidents\(Distance.mi., accidents\)Severity, xlab = “Distance”, ylab = “Severity”)
plot(accidents\(Side, accidents\)Severity, xlab = “Side”, ylab = “Severity”)
plot(accidents\(Temperature.F., accidents\)Severity, xlab = “Temperature”, ylab = “Severity”)
plot(accidents\(Wind_Chill.F., accidents\)Severity, xlab = “wind chill”, ylab = “severity”)
plot(accidents\(Humidity..., accidents\)Severity, xlab = “humidity”, ylab = “severity”)
plot(accidents\(Pressure.in., accidents\)Severity, xlab = “Pressure”, ylab = “Severity”)
plot(accidents\(Visibility.mi., accidents\)Severity, xlab = “Visibility”, ylab = “Severity”)
plot(accidents\(Wind_Speed.mph., accidents\)Severity, xlab = “wind speed”, ylab = “severity”)
plot(accidents\(Precipitation.in., accidents\)Severity, xlab = “Precipitation”, ylab = “severity”)
plot(accidents\(Weather_Condition, accidents\)Severity, xlab = “Precipitation”, ylab = “severity”)
The next chunks of focus on creating models of the data set. The purpose is to seek out which factors are associated with the more traffic interfering accidents.
This code chunk imports the full accidents data along with another data set that contains information about populations given specific zip codes. I originally used the latitude and longitude for determining how much interference the accident had on traffic, but later realized that these location variables have more to do with the population of that particular area and I decided to replace them with population and area.
accidents = read.csv("C:/Users/Max Billante/Documents/Machine Learning/US Accidents Dec19.csv")
zippoparea = read.csv("C:/Users/Max Billante/Documents/ZipPopArea.csv")
#accidents = read.csv("C:/Users/Max Billante/Documents/US_Accidents_Dec19_smallest.csv")
The data set with populations for zip codes had a couple na values. To fix this problem, rule 2 of na.approx is used to return values closest to the data extreme.
zippoparea <- data.frame(na.approx(zippoparea, rule = 2))
Some of the zip codes in the accidents data contain more than five digits (indicating a specific address). Because all zip codes in the population data have five digits, the nine digit zip codes in the accidents data are shortened to five digits so both datasets can merge.
accidents$Zipcode <- substr(accidents$Zipcode, 1, 5)
This code chunk merges the accidents data and the population data together using the variable zip code. The final dataset is stored in accidents so that later models do not need editing.
zippoparea$Zipcode <- zippoparea$zipcode
accidents_population <- merge(accidents, zippoparea, by="Zipcode")
accidents <- accidents_population
This chunk of code removes variables that are not significant in determining the severity of the accidents. ID is removed because it is different for each particular accident and does not provide meaningful information about the severity of the accident. Source indicates the company or group that reported that accident which is also not meaningful for finding the severity of the accident. TMC is the traffic message channel code of the accident if it has one. This is also not meaningful for predicting accident severity, so it was removed. Start_Lat and Start_Lng were used earlier for mapping the location of the accident, but they are being replace with population and area and are not needed to generate future models. End_Lat and End_Lng contained only null values and were not meaningful. Start_Time and End_Time are also not meaningful in determining accident severity, so they are also removed. Description basically summarizes the effect the accidents had on traffic patterns in the area. This was different for every accident and therefore would not be best to visualize the data, so it is removed. Number, Street, City, County, State, Country, zipcodes, and Timezone all had to do with the location of the accidents, which is only meaningful because of the population of that particular area. Using the zipcodes, two other columns with population and area data were added to the dataset and the other location variables are removed. Weather_Timestamp mentions the time in which the weather conditions were reported, which is not meaningful for determining the severity of the accident, so it is removed. Wind_direction tells the general direction the wind was blowing when the accident was reported. There is too much variation with this variable since the direction the car was going would also have to be considered in determining the severity of the accident, so this variable is removed. Weather_Condition tells what the general weather pattern was like when the accident was reported. This is also a categorical variable with a lot of variation and there are other columns that help describe the weather at the time of the accident, so this variable is not needed. Crossing is removed because there are other variables, such as junction and railway that refer to the presence of any kind of crossing. Turning_Loop is removed because none of the accidents in this data involve a turning loop. Civil_Twilight, Nautical_Twilight, and Astronomical_Twilight were three of the four ways in this data set, to describe the time of day during the accidents (i.e. day/night). Only one of these methods is needed for such purposes in this study, so one of these variables, Sunrise_Sunset, is kept and the others are removed.
accidents$ID <- NULL #unique for every accident
accidents$Source <- NULL #not meaningful in accident severity
accidents$TMC <- NULL #not meaningful in accident severity
accidents$Start_Lat <- NULL #using population and area instead
accidents$Start_Lng <- NULL #using population and area instead
accidents$End_Lat <- NULL #all null
accidents$End_Lng <- NULL #all null
accidents$Description <- NULL #unique for every accident
accidents$Number <- NULL #using population and area instead
accidents$Street <- NULL #using population and area instead
accidents$City <- NULL #using population and area instead
accidents$County <- NULL #using population and area instead
accidents$State <- NULL #using population and area instead
accidents$Zipcode <- NULL #using population and area instead
accidents$zipcode <- NULL #using population and area instead
accidents$Country <- NULL #Already know all accidents occured in USA
accidents$Airport_Code <- NULL #not meaningfull in accident severity
accidents$Weather_Timestamp <- NULL #not meaningfull in accident severity
accidents$Wind_Direction <- NULL #to much variation
accidents$Weather_Condition <- NULL #kept other columns for weather instead
accidents$Crossing <- NULL #kept junction and railway columns instead
accidents$Turning_Loop <- NULL #all false
accidents$Civil_Twilight <- NULL #kept sunrise/sunset instead
accidents$Nautical_Twilight <- NULL #kept sunrise/sunset instead
accidents$Astronomical_Twilight <- NULL #kept sunrise/sunset instead
accidents$Timezone <- NULL #using population and area instead
accidents$Start_Time <- NULL #not meaningfull for accidents severity
accidents$End_Time <- NULL #not meaningfull for accidents severity
accidents\(long_lat <- cbind(c(accidents\)Start_Lng), c(accidents\(Start_Lat)) accidents\)Start_Lng <- NULL accidents$Start_Lat <- NULL
This code chunk defines the null values of the unremoved variables in the accidents data.
accidents$Severity[is.na(accidents$Severity)] = "No label"
#accidents$Start_Time[is.na(accidents$Start_Time)] = "Not recorded"
#accidents$End_Time[is.na(accidents$End_Time)] = "Not recorded"
#accidents$Start_Lat[is.na(accidents$Start_Lat)] = "Not recorded"
#accidents$Start_Lng[is.na(accidents$Start_Lng)] = "Not recorded"
accidents$Distance.mi.[is.na(accidents$Distance.mi.)] = 0
accidents$Side[is.na(accidents$Side)] = "Neither"
#accidents$Zipcode[is.na(accidents$Zipcode)] = "None"
#accidents$Timezone[is.na(accidents$Timezone)] = "Unknown"
accidents$Temperature.F.[is.na(accidents$Temperature.F.)] = 0
accidents$Wind_Chill.F.[is.na(accidents$Wind_Chill.F.)] = 0
accidents$Humidity...[is.na(accidents$Humidity...)] = 0
accidents$Pressure.in.[is.na(accidents$Pressure.in.)] = 0
accidents$Visibility.mi.[is.na(accidents$Visibility.mi.)] = 0
accidents$Wind_Speed.mph.[is.na(accidents$Wind_Speed.mph.)] = 0
accidents$Precipitation.in.[is.na(accidents$Precipitation.in.)] = 0
#accidents$Amenity[is.na(accidents$Amenity)] = "False"
#accidents$Bump[is.na(accidents$Bump)] = "False"
#accidents$Give_Way[is.na(accidents$Give_Way)] = "False"
#accidents$Junction[is.na(accidents$Junction)] = "False"
#accidents$No_Exit[is.na(accidents$No_Exit)] = "False"
#accidents$Railway[is.na(accidents$Railway)] = "False"
#accidents$Roundabout[is.na(accidents$Roundabout)] = "False"
#accidents$Station[is.na(accidents$Station)] = "False"
#accidents$Stop[is.na(accidents$Stop)] = "False"
#accidents$Traffic_Calming[is.na(accidents$Traffic_Calming)] = "False"
#accidents$Traffic_Signal[is.na(accidents$Traffic_Signal)] = "False"
accidents$Sunrise_Sunset[is.na(accidents$Sunrise_Sunset)] = "Unknown"
This code chunk displays the time it took to clean up the roadway of the accidents and removes the unneeded variables created in the process.
int <- interval(strptime(accidents\(Start_Time, "%Y-%m-%d %H:%M:%S"), strptime(accidents\)End_Time, “%Y-%m-%d %H:%M:%S”)) accidents$time_in_hours <- time_length(int, unit = “hour”)
accidents <- separate(accidents, Start_Time, c(“Date”, “Start_Time”), " “) accidents <- separate(accidents, End_Time, c(”Date2“,”End_Time“),” “) accidents\(Date2 <- NULL accidents\)day_of_week <- weekdays(as.Date(accidents$Date)) accidents <- separate(accidents, Date, c(”Year“,”Month“,”Day“),”-")
accidents\(Start_Time <- NULL accidents\)End_Time <- NULL #accidents\(Year <- NULL #accidents\)Day <- NULL
This code chunk counts the number of true variables of the true/false variables and displays the number of true variables in a separate column. The true/false variables are then removed.
accidents$Amenity = as.logical(accidents$Amenity)
accidents$Bump = as.logical(accidents$Bump)
accidents$Give_Way = as.logical(accidents$Give_Way)
accidents$Junction = as.logical(accidents$Junction)
accidents$No_Exit = as.logical(accidents$No_Exit)
accidents$Railway = as.logical(accidents$Railway)
accidents$Roundabout = as.logical(accidents$Roundabout)
accidents$Station = as.logical(accidents$Station)
accidents$Stop = as.logical(accidents$Stop)
accidents$Traffic_Calming= as.logical(accidents$Traffic_Calming)
accidents$Traffic_Signal = as.logical(accidents$Traffic_Signal)
temp <- accidents[, 11:21]
temp = apply(temp, 1, sum)
accidents$num_objects <- temp
accidents$Amenity <- NULL
accidents$Bump <- NULL
accidents$Give_Way <- NULL
accidents$Junction <- NULL
accidents$No_Exit <- NULL
accidents$Railway <- NULL
accidents$Roundabout <- NULL
accidents$Station <- NULL
accidents$Stop <- NULL
accidents$Traffic_Calming <- NULL
accidents$Traffic_Signal <- NULL
#library(stringr) #library(naniar) #accidents\(Severity <- as.factor(accidents\)Severity) #accidents\(Month <- as.factor(accidents\)Month) #accidents\(Distance.mi. <- as.factor(accidents\)Distance.mi.) accidents\(Zipcode <- as.character(accidents\)Zipcode) accidents\(lengthzip <- (str_length(accidents\)Zipcode) - 5)/5 accidents\(Simplezip <- substring(accidents\)Zipcode, 1, 5) accidents\(Zipcode <- NULL #accidents\)Temperature.F. <- as.factor(accidents\(Temperature.F.) #accidents\)Wind_Chill.F. <- as.factor(accidents\(Wind_Chill.F.) #accidents\)Humidity… <- as.factor(accidents\(Humidity...) #accidents\)Pressure.in. <- as.factor(accidents\(Pressure.in.) #accidents\)Visibility.mi. <- as.factor(accidents\(Visibility.mi.) #accidents\)Wind_Speed.mph. <- as.factor(accidents\(Wind_Speed.mph.) #accidents\)Precipitation.in. <- as.factor(accidents\(Precipitation.in.) accidents\)time_in_hours <- as.numeric(accidents\(time) accidents\)lengthzip <- as.factor(accidents\(lengthzip) accidents\)Simplezip <- as.factor(accidents\(Simplezip) accidents\)Simplezip <- NULL accidents\(day_of_week <- as.factor(accidents\)day_of_week) #accidents <- replace_with_na(accidents, replace = list(lengthzip = -1)) #accidents\(lengthzip[is.na(accidents\)lengthzip)] = “No Zip Code”
accidents\(Simplezip <- NULL accidents\)Temperature.F. <- as.numeric(accidents\(Temperature.F.) accidents\)Wind_Chill.F. <- as.numeric(accidents\(Wind_Chill.F.) accidents\)Humidity… <- as.numeric(accidents\(Humidity...) accidents\)Pressure.in. <- as.numeric(accidents\(Pressure.in.) accidents\)Visibility.mi. <- as.numeric(accidents\(Visibility.mi.) accidents\)Wind_Speed.mph. <- as.numeric(accidents\(Wind_Speed.mph.) accidents\)Precipitation.in. <- as.numeric(accidents\(Precipitation.in.) accidents\)Distance.mi. <- as.numeric(accidents\(Distance.mi.) accidents\)time_in_hours <- NULL
This code chunk makes simple 5 digit zip codes with the zip codes presented in the data.
library(tidyverse) accidents\(Zipcode <- as.character(accidents\)Zipcode) accidents\(lengthzip <- (str_length(accidents\)Zipcode) - 5)/5 accidents\(Simplezip <- substring(accidents\)Zipcode, 1, 5) accidents$Zipcode <- NULL
library(usmap) plot_usmap(data = accidents, labels=TRUE, values = “Simplezip”, label_color = “red”)
This next chunk creates a data partition to be used for a random forest model. This model will help determine which variables have most impact on the severity factor assigned to the accident. The data is shortened to 50000 observations with random sampling so the code will be able to run.
accidents$Severity <- as.factor(accidents$Severity)
accidents2 <- sample(1:nrow(accidents), 50000, replace = FALSE)
accidents2 <- accidents[accidents2, ]
set.seed(12345)
trainingIndices <- createDataPartition(accidents2$Severity, p = 0.7, list = FALSE)
training <- accidents2[trainingIndices, ]
testing <- accidents2[-trainingIndices, ]
Linear_Model <- train(Severity~., data = accidents, method = ‘lm’, preProcess=c(“center”,“scale”) )
predictedORD <- predict(ORD, newdata = testing) predict(ORD, predictedORD, type=“prob”)
CART <- train(Severity~., data = training,
method="rpart",
trControl=trainControl(method="CV", 10),
preProcess=c("center", "scale"))
This code chunk runs the random forest model with the training data partitioned in line 690. The following code chunk returns the accuracy and kappa of the model and a data frame with indicates how important each of the variables are in predicting Severity. We see that the accuracy of the model is 71.71% and the kappa is 31.25%. So this itself is not a very good model, but it mentions which variables are most important in predicting severity which are area, population, Pressure.in, Humidity…, Temperature.F., Distance.mi., and Wind_Speed.mph.
RF <- train(Severity~., data = training,
method="rf",
trControl=trainControl(method="CV", 10),
preProcess=c("center", "scale"))
predicted <- predict(RF, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 0 2 0 0
## 2 3 8897 2536 313
## 3 4 1167 1838 70
## 4 0 49 40 79
##
## Overall Statistics
##
## Accuracy : 0.721
## 95% CI : (0.7138, 0.7282)
## No Information Rate : 0.6744
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3211
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.0000000 0.8796 0.4164 0.170996
## Specificity 0.9998666 0.4159 0.8827 0.993877
## Pos Pred Value 0.0000000 0.7573 0.5969 0.470238
## Neg Pred Value 0.9995332 0.6251 0.7839 0.974174
## Prevalence 0.0004667 0.6744 0.2943 0.030804
## Detection Rate 0.0000000 0.5932 0.1225 0.005267
## Detection Prevalence 0.0001334 0.7834 0.2053 0.011201
## Balanced Accuracy 0.4999333 0.6478 0.6496 0.582436
varImp(RF$finalModel)
## Overall
## Distance.mi. 1375.1540
## SideL 415.7314
## SideR 419.6110
## Temperature.F. 1645.4152
## Wind_Chill.F. 738.2346
## Humidity... 1703.6717
## Pressure.in. 1882.6875
## Visibility.mi. 443.6721
## Wind_Speed.mph. 1190.2029
## Precipitation.in. 188.6516
## Sunrise_SunsetDay 132.8786
## Sunrise_SunsetNight 136.2855
## people 2524.1789
## area 2730.8120
## num_objects 472.1366
plot(accidents$people, accidents$area)
cor(accidents$people, accidents$area)
## [1] -0.1010014
This code chunk creates another data partition to run an ordinal regression model.
accidents$Severity <- as.factor(accidents$Severity)
set.seed(12345)
trainingIndices <- createDataPartition(accidents$Severity, p = 0.7, list = FALSE)
training <- accidents[trainingIndices, ]
testing <- accidents[-trainingIndices, ]
This code chunk runs an ordinal regression model with the training data partitioned in line 753. The following chunk returns the accuracy and kappa of the model. It is seen that the accuracy is 67.51% and the kappa is 2.21%. So this is not a very good model.
memory.size(max = T)
## [1] 6436.81
library(MASS)
ORD <- polr(Severity~people+area+Pressure.in.+Humidity...+Temperature.F.+Wind_Speed.mph.,
data = training
)
ORD
## Call:
## polr(formula = Severity ~ people + area + Pressure.in. + Humidity... +
## Temperature.F. + Wind_Speed.mph., data = training)
##
## Coefficients:
## people area Pressure.in. Humidity... Temperature.F.
## -3.266532e-06 -2.762921e-04 -5.566311e-03 1.322929e-03 -2.166833e-03
## Wind_Speed.mph.
## 6.489672e-03
##
## Intercepts:
## 1|2 2|3 3|4
## -8.3351671 0.4404243 3.1761940
##
## Residual Deviance: 2930590.68
## AIC: 2930608.68
predicted <- predict(ORD, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 0 0 0 0
## 2 273 574818 253000 26143
## 3 0 3 3 0
## 4 0 4 1 0
##
## Overall Statistics
##
## Accuracy : 0.6729
## 95% CI : (0.6719, 0.6739)
## No Information Rate : 0.6729
## P-Value [Acc > NIR] : 0.5042
##
## Kappa : 0
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.0000000 1.000e+00 1.186e-05 0.000e+00
## Specificity 1.0000000 1.432e-05 1.000e+00 1.000e+00
## Pos Pred Value NaN 6.729e-01 5.000e-01 0.000e+00
## Neg Pred Value 0.9996804 3.636e-01 7.038e-01 9.694e-01
## Prevalence 0.0003196 6.729e-01 2.962e-01 3.060e-02
## Detection Rate 0.0000000 6.729e-01 3.512e-06 0.000e+00
## Detection Prevalence 0.0000000 1.000e+00 7.024e-06 5.853e-06
## Balanced Accuracy 0.5000000 5.000e-01 5.000e-01 5.000e-01
This code chunk create a data partition for a Gradient Boosting Machine model. The data is shortened to 10000 observations so it can run.
accidents$Severity <- as.factor(accidents$Severity)
accidents2 <- sample(1:nrow(accidents), 10000, replace = FALSE)
accidents2 <- accidents[accidents2, ]
set.seed(12345)
trainingIndices <- createDataPartition(accidents2$Severity, p = 0.7, list = FALSE)
training <- accidents2[trainingIndices, ]
testing <- accidents2[-trainingIndices, ]
This code chunk run the Gradient Boosting Machine model with the training data partitioned in line 785 and returns the accuracy, kappa, a data frame indicating how important each variable is in determining Severity, and a relative influence table for each variable. It is seen that the accuracy is 69.29% and the kappa is 14.25%, so this is not the best model.
GBM <- train(Severity~., data=training,
method="gbm",
trControl=trainControl(method="CV", 10),
preProcess=c("center", "scale"))
predicted <- predict(GBM, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 0 0 0 0
## 2 1 1845 696 62
## 3 0 165 197 12
## 4 0 6 4 10
##
## Overall Statistics
##
## Accuracy : 0.6845
## 95% CI : (0.6675, 0.7011)
## No Information Rate : 0.6724
## P-Value [Acc > NIR] : 0.08325
##
## Kappa : 0.1661
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.0000000 0.9152 0.21962 0.119048
## Specificity 1.0000000 0.2271 0.91575 0.996568
## Pos Pred Value NaN 0.7085 0.52674 0.500000
## Neg Pred Value 0.9996664 0.5660 0.73323 0.975151
## Prevalence 0.0003336 0.6724 0.29920 0.028019
## Detection Rate 0.0000000 0.6154 0.06571 0.003336
## Detection Prevalence 0.0000000 0.8686 0.12475 0.006671
## Balanced Accuracy 0.5000000 0.5711 0.56769 0.557808
summary(GBM)
## var rel.inf
## Distance.mi. Distance.mi. 24.1742792
## area area 11.7041802
## SideL SideL 11.6762127
## people people 9.7188676
## SideR SideR 8.6244831
## num_objects num_objects 7.4940384
## Pressure.in. Pressure.in. 5.2466788
## Wind_Chill.F. Wind_Chill.F. 4.5774834
## Temperature.F. Temperature.F. 4.3186677
## Humidity... Humidity... 3.6173549
## Wind_Speed.mph. Wind_Speed.mph. 3.4102515
## Precipitation.in. Precipitation.in. 2.3084135
## Visibility.mi. Visibility.mi. 1.3697412
## Sunrise_SunsetDay Sunrise_SunsetDay 1.2047659
## Sunrise_SunsetNight Sunrise_SunsetNight 0.5545819
This code chunk creates another data partition to run an ordinal regression model.
accidents$Severity <- as.factor(accidents$Severity)
set.seed(12345)
trainingIndices <- createDataPartition(accidents$Severity, p = 0.7, list = FALSE)
training <- accidents[trainingIndices, ]
testing <- accidents[-trainingIndices, ]
This code chunk runs an ordinal regression model with the training data partitioned in line 827. The following chunk returns the accuracy and kappa of the model. It is seen that the accuracy is 67.59% and the kappa is 3.12%. So this is better than the model with Random Forest, but still not very good.
memory.size(max = T)
## [1] 6631.44
library(MASS)
ORD <- polr(Severity~Distance.mi.+area+people+Side+num_objects+Pressure.in.+Wind_Chill.F.,
data = training
)
ORD
## Call:
## polr(formula = Severity ~ Distance.mi. + area + people + Side +
## num_objects + Pressure.in. + Wind_Chill.F., data = training)
##
## Coefficients:
## Distance.mi. area people SideL SideR
## 1.843927e-01 -6.126932e-04 -3.173490e-06 1.550341e-01 1.769782e+00
## num_objects Pressure.in. Wind_Chill.F.
## -7.260030e-01 -4.056157e-03 -5.016252e-03
##
## Intercepts:
## 1|2 2|3 3|4
## -7.517480 1.803786 4.685571
##
## Residual Deviance: 2729297.25
## AIC: 2729319.25
predicted <- predict(ORD, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 0 0 0 0
## 2 273 570244 245233 22512
## 3 0 4430 6985 3450
## 4 0 151 786 181
##
## Overall Statistics
##
## Accuracy : 0.6759
## 95% CI : (0.6749, 0.6769)
## No Information Rate : 0.6729
## P-Value [Acc > NIR] : 1.223e-09
##
## Kappa : 0.0312
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.0000000 0.99203 0.027608 0.0069235
## Specificity 1.0000000 0.04081 0.986894 0.9988685
## Pos Pred Value NaN 0.68027 0.469896 0.1618962
## Neg Pred Value 0.9996804 0.71338 0.706904 0.9695684
## Prevalence 0.0003196 0.67290 0.296173 0.0306036
## Detection Rate 0.0000000 0.66754 0.008177 0.0002119
## Detection Prevalence 0.0000000 0.98129 0.017401 0.0013088
## Balanced Accuracy 0.5000000 0.51642 0.507251 0.5028960
Because the Random Forest model did best of the models run, we will be trying it on the whole data set. However, the data is too big to run in a model all at once, so it is split into smaller components for comparison.
This code chunk remove the variables that are not importance according to the results of the Random Forest model run in line 742.
accidents\(Side <- NULL accidents\)Wind_Chill.F. <- NULL accidents\(Visibility.mi. <- NULL accidents\)Precipitation.in. <- NULL accidents\(Sunrise_Sunset <- NULL accidents\)num_objects <- NULL
set.seed(12345) s_accidents <- accidents subsets_s_accidents <- split(s_accidents, sample(rep(1:10, 50000))) s_accidents_1 <- subsets_s_accidents[[1]] s_accidents_2 <- subsets_s_accidents[[2]] s_accidents_3 <- subsets_s_accidents[[3]] s_accidents_4 <- subsets_s_accidents[[4]] s_accidents_5 <- subsets_s_accidents[[5]] s_accidents_6 <- subsets_s_accidents[[6]] s_accidents_7 <- subsets_s_accidents[[7]] s_accidents_8 <- subsets_s_accidents[[8]] s_accidents_9 <- subsets_s_accidents[[9]] s_accidents_10 <- subsets_s_accidents[[10]]
set.seed(12345) trainingIndices <- createDataPartition(s_accidents_1$Severity, p = 0.7, list = FALSE) training <- s_accidents_1[trainingIndices, ] testing <- s_accidents_1[-trainingIndices, ]
gc() RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
set.seed(12345) trainingIndices <- createDataPartition(s_accidents_2$Severity, p = 0.7, list = FALSE) training <- s_accidents_2[trainingIndices, ] testing <- s_accidents_2[-trainingIndices, ]
gc() RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_3$Severity, p = 0.7, list = FALSE) training <- s_accidents_3[trainingIndices, ] testing <- s_accidents_3[-trainingIndices, ]
gc() RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_4$Severity, p = 0.7, list = FALSE) training <- s_accidents_4[trainingIndices, ] testing <- s_accidents_4[-trainingIndices, ]
gc() RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_5$Severity, p = 0.7, list = FALSE) training <- s_accidents_5[trainingIndices, ] testing <- s_accidents_5[-trainingIndices, ]
gc() RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_6$Severity, p = 0.7, list = FALSE) training <- s_accidents_6[trainingIndices, ] testing <- s_accidents_6[-trainingIndices, ]
gc() RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_7$Severity, p = 0.7, list = FALSE) training <- s_accidents_7[trainingIndices, ] testing <- s_accidents_7[-trainingIndices, ]
RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_8$Severity, p = 0.7, list = FALSE) training <- s_accidents_8[trainingIndices, ] testing <- s_accidents_8[-trainingIndices, ]
RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_9$Severity, p = 0.7, list = FALSE) training <- s_accidents_9[trainingIndices, ] testing <- s_accidents_9[-trainingIndices, ]
RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
gc() set.seed(12345) trainingIndices <- createDataPartition(s_accidents_10$Severity, p = 0.7, list = FALSE) training <- s_accidents_10[trainingIndices, ] testing <- s_accidents_10[-trainingIndices, ]
RF <- train(Severity~., data = training, method=“rf”, trControl=trainControl(method=“CV”, 10), preProcess=c(“center”, “scale”))
predicted <- predict(RF, newdata = testing) confusionMatrix(predicted, testing$Severity)
To conclude this project, more variables would need to be taken into consideration to produce better models for predicting which factor really contribute to car accidents that interfere with the flow of traffic. As a follow up study, We could try to analyze other factors that cause accidents to interfere with traffic and run a Random Forest model to see which ones are most important and create an ordinal regression model with them. It would also be good to look for other accurate models as well.