The objective

This project uses the dataset called, “US Accidents (3.5 million records) A Countrywide Traffic Accident Dataset (2016 - 2020),” to analyze particular factors that contribute to car accidents that cause most interference in traffic. The dataset contain all sorts of information about each accident. There are a total of 2,974,335 observations with 49 variables for each observation in the dataset. Some of the variables are categorical. They mention if a particular object or road condition was present at the time of the accident or where exactly the accident occurred and the time of day. Other variables are quantitative. They mention how much of a weather related item was present during the accident, such as wind or precipitation. This study uses the ordinal variable, Severity, as a response variable. Severity assigns a value from 1 to 4 to describe how much impact the accident had on traffic with 4 being the most amount of interference and 1 being least amount of interference. In this study, we seek to find out what particular factors of car accidents are most associated with the amount of resulting traffic interference. We use the other variables in this data set to find out.

Load packages

The R packages for this project are loaded here.

library(plyr)
library(maps)
library(lubridate)
library(caret)
library(zoo)

Visualize the data

Before we begin making models, it is a good idea to perform some data visualization techniques to grasp what the data looks like. The next couple of code chunks focus on this.

Import subseted data

Because the accidents data set is so big, we import a subset of the data to be used for visualization purposes. This subset is a random sample of 5000 observations in this data set.

accidents = read.csv("C:/Users/Max Billante/Documents/Machine Learning/us accidents small.csv")

Make a frequency table for severity

First, we make a frequency table of the variable severity. From the table it is seen that most of the accidents in this data set have a severity factor of 2. A severity factor of 3 is neither common nor uncommon in this data set. There are few severity factors of 1 and 4 in this data set, but 4 is more common than 1.

accidents$Severity <- as.integer(accidents$Severity)
count_severity <- c(sum(accidents$Severity==1), sum(accidents$Severity==2), sum(accidents$Severity==3), sum(accidents$Severity==4))
rf <- count_severity/sum(count_severity)
barplot(rf, names.arg = 1:4, main = "Relative Frequencies of Severity")

Map latitude and longitude on USA

Next, we make a map with two variables from the data set, Start_Lng and Start_Lat. We see that more accidents occur in urban areas across the east, west, and middle sections of the USA.

map("usa", fill = TRUE, col = "white", bg = "lightblue")
points(accidents$Start_Lng, accidents$Start_Lat, pch = 20, cex  = 0.01, col = "red")

Display the time of accident

With the variables Start_Time and End_Time, we calculate the amount of time it took to clean up the road of each accident and display it in a histogram. We can see from histogram that the general trend of times it took to clear the accidents is right skewed with an unexpected peak around 6 hours and an outlier of 34.24 hours in North Bend, Oregon. This outlier can be seen in the accompanying data frame. It does not throw off the general trend of the distribution. It is a collective outlier.

interval <- interval(strptime(accidents$Start_Time, "%Y-%m-%d %H:%M:%S"), strptime(accidents$End_Time, "%Y-%m-%d %H:%M:%S"))
accidents$time_in_hours <- time_length(interval, unit = "hour")
hist(accidents$time_in_hours, breaks=seq(0,35,0.5), xlab = "Time in Hours", main = "Histogram of Time")

time_in_hours_outliears <- accidents[ which(accidents$time_in_hours > 10), ]
time_in_hours_outliears[c(5,51,17,19)]
##      Severity time_in_hours            City State
## 22          2      34.23639      North Bend    OR
## 1410        2      11.58139 Point Mugu Nawc    CA
## 2209        4      19.05972      Blue River    OR

Analyze distances

We analyze the amount of road space affected by the accidents with a histogram of the Distance.mi variable. We see from the histogram that the distribution of distances is right skewed with most reported distances being 0 miles. This is probably misleading though since most car accidents do not even affect more than a mile of the road, but only a few feet. From a glance, there may seem to be no values after 5 miles on this Histogram, but upon further analysis, it is seen that there are some reported accidents that took up 20-30 miles of roadway. These values can be seen in the accompanying data frame. There is one collective outlier of 62.360 miles. This accident occurred in Spring Creek, Nevada. We also see in the scatterplot below the histogram that this accidents has a severity factor of 2, opposite of what would be expected. It is possible however, that an accident affecting this amount of road space would be highly regulated by police that it would be reported with a lower amount of severity in terms of traffic interference.

temp2 <- accidents$Distance.mi.
temp2 = temp2[temp2 < 5]
hist(temp2, breaks=seq(0,5,0.5), xlab = "Distance (miles)", main="Histogram of Distance")

distance_outliers <- accidents[ which(accidents$Distance.mi. > 10), ]
distance_outliers[c(5,12,17,19)]
##      Severity Distance.mi.         City State
## 561         3       19.710    Paragonah    UT
## 987         2       23.880     Syracuse    NY
## 1225        3       23.530    Weedsport    NY
## 1614        3       28.410    Westfield    NY
## 1767        3       11.720        Mayer    AZ
## 1828        4       27.650 Fort Bridger    WY
## 2072        2       11.545       Morton    IL
## 2489        3       29.370 Wilkes Barre    PA
## 2938        3       10.290   Tuscaloosa    AL
## 2959        3       29.780    Lehighton    PA
## 3198        3       22.590     Hillburn    NY
## 3304        3       15.540   Saugerties    NY
## 3504        3       23.760     Kingston    NY
## 3922        4       20.605       Ganado    AZ
## 4565        2       62.360 Spring Creek    NV
plot(accidents$Severity, accidents$Distance.mi., xlab = "Severity", ylab = "Distance (miles)")

Analyze side

We analyze which side of the road the accidents were reported with a frequency table of the variable Side. We see from the table that most the accidents were located on the right side of the road as opposed from the left, which make sense since people drive on the right side of the road on one way streets and highways and steer to the left to avoid coming into contact with stopped vehicles on the right side of the road.

count_side <- count(accidents$Side)
rf_side <- count_side$freq/sum(count_side$freq)
barplot(rf_side, names.arg = count_side$x, main = "Relative Frequency of Side")

Make a frequency table for timezone

We analyze the variable Timezone with a frequency table. It helps determine which areas of the United States have more of the reported accidents in this data set. We see that most of the accidents occurred in the eastern region. The western and central region are about the same, but the western region has slightly more accidents than the central region. The mountain region does not have a lot of accidents, which makes sense since not many people live there. The unlabeled flat bar is the data that was not available.

count_timezone <- count(accidents$Timezone)
rf_timezone <- count_timezone$freq/sum(count_timezone$freq)
barplot(rf_timezone, names.arg = count_timezone$x, main = "Relative Frequency of timezone")

Analyze temperature

We analyze the reported temperature at the time of the accidents with a histogram. The histogram shows the distribution of the temperatures are roughly skewed to the left. At 0 degrees Fahrenheit, there are few accidents probably because these temperatures are not common in the United States. As the temperature starts to increase, the number of accidents goes up since more people are out when it is warmer. When the temperature continues to increase from there, the number of accidents starts to go back down since fewer people are out if it is too hot.

hist(accidents$Temperature.F., breaks=seq(-40, 120, 5), xlab = "Temperature (Fahrenheit)", main = "Histogram of Temperature")

Analyze wind_chill

We also use a histogram to analyze the wind chill temperatures, the temperature it feels like when wind is present. The distribution is similar to that of the Temperature.F variable, but it takes a dive around 50 degrees and comes back up, causing the distribution to be bimodal. It is a fact that wind chill temperatures are only defined at temperatures below 50 degrees. We can see in the scatterplot below the histogram that the wind chill temperatures are only colder than the normal temperatures below 50 degrees, which is expected. It is hard to see why the distribution of accidents given wind chill is bimodal. The histogram seems to indicate that less accidents occur at wind chill temperatures around 50, but this is probably because this particular wind chill temperature is not very common.

hist(accidents$Wind_Chill.F., breaks=seq(-50, 120, 5), xlab = "Wind Chill (Fahrenheit)", main = "Histogram of Wind Chill")

plot(accidents$Temperature.F., accidents$Wind_Chill.F., xlab = "Temperature (Fahrenheit)")
abline(0, 1)

## Analyze humidity

We use a histogram to analyze the Humidity variable. We see from the histogram that the number of accidents appears to increase as the humidity goes up until the humidity level reaches 50 in which the number of accidents are generally stable. There is a dive in the graph around humidity levels of 90 to 100. It makes sense for there to be less accidents in lower humidity levels because less people would not be out as often if it is to dry or these humidity levels are rare. It might be believable to have some variability in humidity levels around 100 because too much air moisture could cause some people to not want to go out, but other people would not mind. The accompanying scatter plot shows plenty of variability in humidity and temperature, so there is no apparent relationship between these two variables.

hist(accidents$Humidity..., breaks=seq(0,105,3), xlab = "Humidity", main = "Histogram of Humidity")

plot(accidents$Temperature.F., accidents$Humidity..., cex = 0.01, xlab = "Temperature (Fahrenheit)", ylab = "Humidity")

Analyze pressure

The Pressure.in. variable is the reported are pressure in inches at the time of the accidents. We see from the histogram that the distribution of air pressure is approximately normal with a mean around 30 inches of air pressure. There are some collective outliers between air pressure of 24 to 26 inches.

hist(accidents$Pressure.in., breaks=seq(23,32,0.3), xlab = "Pressure (in.)", main = "Histogram of Pressure")

Analyze visibility

Here is a histogram of the Visibility.mi. variable. As expected more accidents occur in when visibility is low. However, the histogram shows much more accidents occurring when the visibility is around 10 miles as opposed to accidents in which the visibility is less than 9 miles. This could be explained by people being unwilling to drive if visibility is to low and 10 miles could be the one visibility that most people think is okay to drive in and end up underestimating the difficulty of the drive. Accidents in visibilities greater than 20 miles are listed in the accompanying data frame.

temp4 <- accidents$Visibility.mi.
temp4 = temp4[temp4 < 12]
hist(temp4, breaks=seq(0,12,2), xlab = "Visibility (miles)", main = "Histogram of Visibility")

visibility_outliers <- accidents[ which(accidents$Visibility.mi. > 20), ]
visibility_outliers[c(5,29,17,19)]
##      Severity Visibility.mi.        City State
## 283         2             30        Mesa    AZ
## 353         2             25  Broomfield    CO
## 533         3             40        Mesa    AZ
## 767         2             40        Mesa    AZ
## 946         2             30      Denver    CO
## 1254        2             30      Denver    CO
## 1256        2             30      Denver    CO
## 1821        3             30      Denver    CO
## 2224        3             30      Denver    CO
## 2540        2             40        Mesa    AZ
## 2620        3             30      Denver    CO
## 2639        2             50 Westminster    CO
## 2834        2             50      Denver    CO
## 2921        2             40      Denver    CO
## 3598        3             30    Morrison    CO
## 3849        3             40      Denver    CO
## 4103        3             50      Denver    CO

Analyze wind speed

For the Wind_Speed.mph. variable, the histogram shows that distribution of accidents by wind speed is skewed to the right. Opposite of what we would expect, more accidents appear to generally occur with lower wind speeds. It is hard to explain how general wind speed could affect accidents as the direction of the wind relative to the car would need to considered. It is plausible to say that less people would be driving if the wind were too strong, which would yield less accidents. It is also plausible to say that wind speeds too fast are not common and there would not be many accidents given these wind speeds.

hist(accidents$Wind_Speed.mph., breaks=seq(0,40,3), xlab = "Wind Speed (mph)", main = "Histogram of Wind Speed")

Analyze precipitation

For the Precipitation.in. variable, the histogram shows that the distribution of accidents by precipitation is right skewed. Opposite of what we would expect, much of accidents in this data set have almost zero inches of precipitation. (This may be misleading since some of the observations in the Precipitation.in. variable are not available.). Other than this, there are few accidents that occur with small amounts of precipitation and an outlier with 9.99 inches of precipitation in Brooklyn, New York, shown in the accompanying data frame. This could be explained with less people driving if there is too much precipitation or very high amounts of precipitation being very rare.

temp3 <- accidents$Precipitation.in.
temp3 = temp3[temp3 < 1]
hist(temp3, breaks=seq(0,1,0.1), xlab = "Precipitation (in.)", main = "Histogram of Precipitation")

precipitation_outliers <- accidents[ which(accidents$Precipitation.in. > 1), ]
precipitation_outliers[c(5,32,17,19)]
##      Severity Precipitation.in.     City State
## 1676        3              9.99 Brooklyn    NY

The following bar plots are all true/false variables. They mostly indicate the presence of a particular road characteristic that could possibly interfere with the amount of accidents that take place. As shown, most of the bar plots have more false characteristics than true, so the future models involve a column that counts the true characteristics for each accident in the data set.

Make a bar plot for amenity

count_amenity <- count(accidents$Amenity)
rf_amenity <- count_amenity$freq/sum(count_amenity$freq)
barplot(rf_amenity, names.arg = count_amenity$x, main = "Table of amenity")

Make a bar plot for bump

count_bump <- count(accidents$Bump)
rf_bump <- count_bump$freq/sum(count_bump$freq)
barplot(rf_bump, names.arg = count_bump$x, main = "Table of bump")

Make a bar plot for crossing

count_crossing <- count(accidents$Crossing)
rf_crossing <- count_crossing$freq/sum(count_crossing$freq)
barplot(rf_crossing, names.arg = count_crossing$x, main = "Table of crossing")

Make a bar plot for give way

count_give_way <- count(accidents$Give_Way)
rf_give_way <- count_give_way$freq/sum(count_give_way$freq)
barplot(rf_give_way, names.arg = count_give_way$x, main = "Table of give way")

Make a bar plot for junction

count_junction <- count(accidents$Junction)
rf_junction <- count_junction$freq/sum(count_junction$freq)
barplot(rf_junction, names.arg = count_junction$x, main = "Table of junction")

Make a bar plot for no_exit

count_no_exit <- count(accidents$No_Exit)
rf_no_exit <- count_no_exit$freq/sum(count_no_exit$freq)
barplot(rf_no_exit, names.arg = count_no_exit$x, main = "Table of no exit")

Make a bar plot for railway

count_railway <- count(accidents$Railway)
rf_railway <- count_railway$freq/sum(count_railway$freq)
barplot(rf_railway, names.arg = count_railway$x, main = "Table of railway")

Make a bar plot for roundabout

count_roundabout <- count(accidents$Roundabout)
rf_roundabout <- count_roundabout$freq/sum(count_roundabout$freq)
barplot(rf_roundabout, names.arg = count_roundabout$x, main = "Table of roundabout")

Make a bar plot for station

count_station <- count(accidents$Station)
rf_station <- count_station$freq/sum(count_station$freq)
barplot(rf_station, names.arg = count_station$x, main = "Table of station")

Make a bar plot for stop

count_stop <- count(accidents$Stop)
rf_stop <- count_stop$freq/sum(count_stop$freq)
barplot(rf_stop, names.arg = count_stop$x, main = "Table of stop")

Make a bar plot for traffic_calming

count_traffic_calming <- count (accidents$Traffic_Calming)
rf_traffic_calming <- count_traffic_calming$freq/sum(count_traffic_calming$freq)
barplot(rf_traffic_calming, names.arg = count_traffic_calming$x, main = "Table of traffic calming")

Make a bar plot for traffic signal

count_traffic_signal <- count(accidents$Traffic_Signal)
rf_traffic_signal <- count_traffic_signal$freq/sum(count_traffic_signal$freq)
barplot(rf_traffic_signal, names.arg = count_traffic_signal$x, main = "Table of traffic signal")

Make a bar plot for turning loop

count_turning_loop <- count(accidents$Turning_Loop)
rf_turning_loop <- count_turning_loop$freq/sum(count_turning_loop$freq)
barplot(rf_turning_loop, names.arg = count_turning_loop$x, main = "Table of turning loop")

Analyze sunrise_sunset

This frequency table is for the sunrise_sunset variable. The table shows that more accidents occur during the day than overnight. This may seem contrary to what we would expect from a glance, but it is also known that less people are out on the road at night. This would explain way there are less accidents overnight.

count_sunrise_sunset <- count(accidents$Sunrise_Sunset)
rf_sunrise_sunset <- count_sunrise_sunset$freq/sum(count_sunrise_sunset$freq)
barplot(rf_sunrise_sunset, names.arg = count_sunrise_sunset$x, main = "Relative Frequency of sunrise_sunset")

Create models

Now that we have visualized the data, we can begin making models to find out which variables in this data set are most associated with car accidents that have more of an effect on traffic. The following code chunks focus on this.

Import full accidents data and zippoparea data

We now import the full accidents data along with another data set that contains information about populations given specific zip codes. We originally used the latitude and longitude for determining how much interference the accident had on traffic, but later realized that these location variables have more to do with the population of that particular area, so they are replaced with population and area.

accidents = read.csv("C:/Users/Max Billante/Documents/Machine Learning/US Accidents Dec19.csv")
zippoparea = read.csv("C:/Users/Max Billante/Documents/ZipPopArea.csv")

Fill in NA values of zippoparea data

The data set with populations for zip codes had a couple NA values. To fix this problem, rule 2 of na.approx is used to return values closest to the data extreme.

zippoparea <- data.frame(na.approx(zippoparea, rule = 2))

Substring zipcodes so that they are all five digits

Some of the zip codes in the accidents data contain more than five digits (indicating a specific address). Because all zip codes in the population data have five digits, the nine digit zip codes in the accidents data are shortened to five digits so both datasets can merge.

accidents$Zipcode <- substr(accidents$Zipcode, 1, 5)

Merge accidents data and zippoparea data

Now we merge the accidents data and the population data together using the variable zip code. The final dataset is stored in accidents so that later models do not need editing.

zippoparea$Zipcode <- zippoparea$zipcode
accidents_population <- merge(accidents, zippoparea, by="Zipcode")
accidents <- accidents_population

Remove unneeded variables

Here we remove variables that are not significant in determining the severity of the accidents.

accidents$ID <- NULL #unique for every accident
accidents$Source <- NULL #not meaningful in accident severity
accidents$TMC <- NULL #not meaningful in accident severity
accidents$Start_Lat <- NULL #using population and area instead
accidents$Start_Lng <- NULL #using population and area instead
accidents$End_Lat <- NULL #all null
accidents$End_Lng <- NULL #all null
accidents$Description <- NULL #unique for every accident
accidents$Number <- NULL #using population and area instead
accidents$Street <- NULL #using population and area instead
accidents$City <- NULL #using population and area instead
accidents$County <- NULL #using population and area instead
accidents$State <- NULL #using population and area instead
accidents$Zipcode <- NULL #using population and area instead
accidents$zipcode <- NULL #using population and area instead
accidents$Country <- NULL #Already know all accidents occured in USA
accidents$Airport_Code <- NULL #not meaningfull in accident severity
accidents$Weather_Timestamp <- NULL #not meaningfull in accident severity
accidents$Wind_Direction <- NULL #to much variation
accidents$Weather_Condition <- NULL #kept other columns for weather instead
accidents$Crossing <- NULL #kept junction and railway columns instead
accidents$Turning_Loop <- NULL #all false
accidents$Civil_Twilight <- NULL #kept sunrise/sunset instead
accidents$Nautical_Twilight <- NULL #kept sunrise/sunset instead
accidents$Astronomical_Twilight <- NULL #kept sunrise/sunset instead
accidents$Timezone <- NULL #using population and area instead
accidents$Start_Time <- NULL #not meaningfull for accidents severity
accidents$End_Time <- NULL #not meaningfull for accidents severity

Define NA values

Here we define the NA values of the unremoved variables in the accidents data.

accidents$Severity[is.na(accidents$Severity)] = "No label"
accidents$Distance.mi.[is.na(accidents$Distance.mi.)] = 0
accidents$Side[is.na(accidents$Side)] = "Neither"
accidents$Temperature.F.[is.na(accidents$Temperature.F.)] = na.approx(accidents$Temperature.F., rule = 2)
accidents$Wind_Chill.F.[is.na(accidents$Wind_Chill.F.)] = na.approx(accidents$Wind_Chill.F., rule = 2)
accidents$Humidity...[is.na(accidents$Humidity...)] = 0
accidents$Pressure.in.[is.na(accidents$Pressure.in.)] = na.approx(accidents$Pressure.in., rule = 2)
accidents$Visibility.mi.[is.na(accidents$Visibility.mi.)] = 0
accidents$Wind_Speed.mph.[is.na(accidents$Wind_Speed.mph.)] = 0
accidents$Precipitation.in.[is.na(accidents$Precipitation.in.)] = 0
accidents$Sunrise_Sunset[is.na(accidents$Sunrise_Sunset)] = "Unknown"

Collapse object data

As mentioned earlier, we add an column to count the number of true variables of the true/false variables. The true/false variables are then removed.

accidents$Amenity = as.logical(accidents$Amenity)
accidents$Bump = as.logical(accidents$Bump)
accidents$Give_Way = as.logical(accidents$Give_Way)
accidents$Junction = as.logical(accidents$Junction)
accidents$No_Exit = as.logical(accidents$No_Exit)
accidents$Railway = as.logical(accidents$Railway)
accidents$Roundabout = as.logical(accidents$Roundabout)
accidents$Station = as.logical(accidents$Station)
accidents$Stop = as.logical(accidents$Stop)
accidents$Traffic_Calming= as.logical(accidents$Traffic_Calming)
accidents$Traffic_Signal = as.logical(accidents$Traffic_Signal)
temp <- accidents[, 11:21]
temp = apply(temp, 1, sum)
accidents$num_objects <- temp
accidents$Amenity <- NULL
accidents$Bump <- NULL
accidents$Give_Way <- NULL
accidents$Junction <- NULL
accidents$No_Exit <- NULL
accidents$Railway <- NULL
accidents$Roundabout <- NULL
accidents$Station <- NULL
accidents$Stop <- NULL
accidents$Traffic_Calming <- NULL
accidents$Traffic_Signal <- NULL

Create data partition for RF

This next chunk creates a data partition to be used for a random forest model. This model will help determine which variables have most impact on the severity factor assigned to the accident. The data is shortened to 50000 observations with random sampling so the code will be able to run.

accidents$Severity <- as.factor(accidents$Severity)
accidents2 <- sample(1:nrow(accidents), 50000, replace = FALSE)
accidents2 <- accidents[accidents2, ]
set.seed(12345)
trainingIndices <- createDataPartition(accidents2$Severity, p = 0.7, list = FALSE)
training <- accidents2[trainingIndices, ]
testing <- accidents2[-trainingIndices, ]

Run Random Forest model for severity

Now we run the random forest model with the training data previously partitioned. It returns the accuracy and kappa of the model and a data frame which indicates how important each of the variables are in predicting Severity. We see that the accuracy of the model is 71.71% and the kappa is 31.25%. So this itself is not a very good model, but it mentions which variables are most important in predicting severity which are area, population, Pressure.in, Humidity…, Temperature.F., Distance.mi., and Wind_Speed.mph.

RF <- train(Severity~., data = training,
            method="rf",
            trControl=trainControl(method="CV", 10),
            preProcess=c("center", "scale"))
predicted <- predict(RF, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1    0    0    0    0
##          2    3 8849 2767  288
##          3    0 1181 1660   83
##          4    0   62   31   74
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7056          
##                  95% CI : (0.6983, 0.7129)
##     No Information Rate : 0.6729          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2776          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.0000   0.8768   0.3724 0.166292
## Specificity            1.0000   0.3767   0.8801 0.993610
## Pos Pred Value            NaN   0.7432   0.5677 0.443114
## Neg Pred Value         0.9998   0.5979   0.7683 0.974985
## Prevalence             0.0002   0.6729   0.2972 0.029671
## Detection Rate         0.0000   0.5900   0.1107 0.004934
## Detection Prevalence   0.0000   0.7939   0.1950 0.011135
## Balanced Accuracy      0.5000   0.6268   0.6262 0.579951
varImp(RF$finalModel)
##                       Overall
## Distance.mi.        1325.4338
## SideL                403.9136
## SideR                426.5110
## Temperature.F.      1652.2334
## Wind_Chill.F.       1826.0888
## Humidity...         1574.6389
## Pressure.in.        1785.5545
## Visibility.mi.       446.4112
## Wind_Speed.mph.     1136.5105
## Precipitation.in.    175.3047
## Sunrise_SunsetDay    144.0949
## Sunrise_SunsetNight  137.6184
## people              2176.6793
## area                2315.9165
## num_objects          507.7746

Build a linear regression model with most important variables

accidents$Severity <- as.numeric(accidents$Severity)
LinearModel <- lm(as.numeric(Severity)~people+area+Pressure.in.+Humidity...+Temperature.F.+Wind_Speed.mph., 
            data = training)
summary(LinearModel)
## 
## Call:
## lm(formula = as.numeric(Severity) ~ people + area + Pressure.in. + 
##     Humidity... + Temperature.F. + Wind_Speed.mph., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4152 -0.3693 -0.3333  0.6177  1.7735 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.061e+00  1.196e-01  17.238  < 2e-16 ***
## people          -1.462e-06  1.469e-07  -9.953  < 2e-16 ***
## area            -1.463e-05  2.531e-05  -0.578 0.563279    
## Pressure.in.     1.139e-02  3.963e-03   2.874 0.004061 ** 
## Humidity...      4.680e-04  1.246e-04   3.755 0.000174 ***
## Temperature.F.  -5.820e-04  1.625e-04  -3.582 0.000341 ***
## Wind_Speed.mph.  1.562e-03  5.418e-04   2.884 0.003933 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5367 on 34995 degrees of freedom
## Multiple R-squared:  0.004742,   Adjusted R-squared:  0.004572 
## F-statistic: 27.79 on 6 and 34995 DF,  p-value: < 2.2e-16
exp(coef(LinearModel))
##     (Intercept)          people            area    Pressure.in.     Humidity... 
##       7.8526958       0.9999985       0.9999854       1.0114533       1.0004681 
##  Temperature.F. Wind_Speed.mph. 
##       0.9994181       1.0015636

Plot people versus area

Before we make an ordinal regression model, it may be helpful to know if there is a correlation between population and area. We make a scatterplot to find out. We can see that there is no apparent relationship between these two variables, but the number of people in one place appears to go down as the area of the place goes up.

plot(accidents$people, accidents$area)

cor(accidents$people, accidents$area)
## [1] -0.1010014

Create data partition for ordinal regression

Now we want to run an ordinal regression model with the most important variables according to the random forest model. First we partition the data into training and testing sets.

accidents$Severity <- as.factor(accidents$Severity)
set.seed(12345)
trainingIndices <- createDataPartition(accidents$Severity, p = 0.7, list = FALSE)
training <- accidents[trainingIndices, ]
testing <- accidents[-trainingIndices, ]

Run an ordinal regression model for severity

Now we run the ordinal regression model with the training data previously partitioned. It returns the accuracy and kappa of the model. It is seen that the accuracy is 67.51% and the kappa is 2.21%. This kappa value indicates that, there is almost no discrepancy between the observed probability of a severe traffic influencing accident and this probability in a bad case. So this is not a very good model.

memory.size(max = T)
## [1] 6519.19
library(MASS)
ORD <- polr(Severity~people+area+Pressure.in.+Humidity...+Temperature.F.+Wind_Speed.mph., 
            data = training
            )
ORD
## Call:
## polr(formula = Severity ~ people + area + Pressure.in. + Humidity... + 
##     Temperature.F. + Wind_Speed.mph., data = training)
## 
## Coefficients:
##          people            area    Pressure.in.     Humidity...  Temperature.F. 
##   -3.420683e-06   -2.069004e-04    6.054184e-02    6.478983e-04   -2.058698e-03 
## Wind_Speed.mph. 
##    5.638299e-03 
## 
## Intercepts:
##       1|2       2|3       3|4 
## -6.413312  2.362706  5.098536 
## 
## Residual Deviance: 2930461.77 
## AIC: 2930479.77
predicted <- predict(ORD, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      1      2      3      4
##          1      0      0      0      0
##          2    273 574820 253002  26143
##          3      0      1      2      0
##          4      0      4      0      0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6729          
##                  95% CI : (0.6719, 0.6739)
##     No Information Rate : 0.6729          
##     P-Value [Acc > NIR] : 0.5033          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 1  Class: 2  Class: 3  Class: 4
## Sensitivity          0.0000000 1.000e+00 7.905e-06 0.000e+00
## Specificity          1.0000000 7.158e-06 1.000e+00 1.000e+00
## Pos Pred Value             NaN 6.729e-01 6.667e-01 0.000e+00
## Neg Pred Value       0.9996804 2.857e-01 7.038e-01 9.694e-01
## Prevalence           0.0003196 6.729e-01 2.962e-01 3.060e-02
## Detection Rate       0.0000000 6.729e-01 2.341e-06 0.000e+00
## Detection Prevalence 0.0000000 1.000e+00 3.512e-06 4.682e-06
## Balanced Accuracy    0.5000000 5.000e-01 5.000e-01 5.000e-01
exp(coef(ORD))
##          people            area    Pressure.in.     Humidity...  Temperature.F. 
##       0.9999966       0.9997931       1.0624121       1.0006481       0.9979434 
## Wind_Speed.mph. 
##       1.0056542

Split severity into two binary terms and run an ordinal regression model

Because the lowest and highest values for severity are rare compared to the two middle values, we split this variable into two binary terms. Then another ordinal regression model is created. This is the best model. It can be seen that the P value for each variable is close to zero.

accidents <- cbind(accidents[1:2], sapply(levels(accidents$Severity), function(x) as.integer(x == accidents$Severity)), accidents[3:14])
accidents$LowSeverity <- accidents$`1`+accidents$`2`
accidents$HighSeverity <- accidents$`3`+accidents$`4`
accidents$`1` <- NULL
accidents$`2` <- NULL
accidents$`3` <- NULL
accidents$`4` <- NULL
accidents$HighLowSeverity <- accidents$HighSeverity
accidents$LowSeverity <- NULL
accidents$HighSeverity <- NULL
accidents$HighLowSeverity <- as.factor(accidents$HighLowSeverity)
accidents2 <- sample(1:nrow(accidents), 50000, replace = FALSE)
accidents2 <- accidents[accidents2, ]
set.seed(12345)
trainingIndices <- createDataPartition(accidents2$Severity, p = 0.7, list = FALSE)
training <- accidents2[trainingIndices, ]
testing <- accidents2[-trainingIndices, ]
Logit <- glm(HighLowSeverity~people+area+Pressure.in.+Humidity...+Temperature.F.+Wind_Speed.mph., 
            data = training,
            family = binomial)
summary(Logit)
## 
## Call:
## glm(formula = HighLowSeverity ~ people + area + Pressure.in. + 
##     Humidity... + Temperature.F. + Wind_Speed.mph., family = binomial, 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0190  -0.8976  -0.8726   1.4707   1.8638  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.253e+00  5.193e-01  -6.264 3.76e-10 ***
## people          -1.816e-06  5.965e-07  -3.044 0.002337 ** 
## area            -8.940e-05  9.860e-05  -0.907 0.364560    
## Pressure.in.     8.778e-02  1.718e-02   5.109 3.24e-07 ***
## Humidity...      2.336e-04  5.001e-04   0.467 0.640411    
## Temperature.F.  -1.589e-03  6.377e-04  -2.492 0.012693 *  
## Wind_Speed.mph.  7.921e-03  2.148e-03   3.687 0.000227 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 44209  on 35001  degrees of freedom
## Residual deviance: 44147  on 34995  degrees of freedom
## AIC: 44161
## 
## Number of Fisher Scoring iterations: 4

Create data partition for GBM

Seeing that the ordinal regression model was not very good, we tried it again using a gradient boosting machine model to decide which variables are most important. First we partition the data into training and testing sets. The data is shortened to 10000 observations so it can run.

accidents$Severity <- as.factor(accidents$Severity)
accidents2 <- sample(1:nrow(accidents), 10000, replace = FALSE)
accidents2 <- accidents[accidents2, ]
set.seed(12345)
trainingIndices <- createDataPartition(accidents2$Severity, p = 0.7, list = FALSE)
## Warning in createDataPartition(accidents2$Severity, p = 0.7, list = FALSE): Some
## classes have a single record ( 1 ) and these will be selected for the sample
training <- accidents2[trainingIndices, ]
testing <- accidents2[-trainingIndices, ]

Run gradient boosting machine model for severity

Now we run the gradient boosting machine model with the training data previously partitioned and receive the accuracy, kappa, a data frame indicating how important each variable is in determining Severity, and a relative influence table for each variable. It is seen that the accuracy is 69.29% and the kappa is 14.25%, so this is not the best model.

GBM <- train(Severity~., data=training,
             method="gbm",
             trControl=trainControl(method="CV", 10),
             preProcess=c("center", "scale"))
predicted <- predict(GBM, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1    0    0    0    0
##          2    0 2021    0    0
##          3    0    0  869   54
##          4    0    0   20   34
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9753          
##                  95% CI : (0.9691, 0.9806)
##     No Information Rate : 0.6741          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9456          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity                NA   1.0000   0.9775  0.38636
## Specificity                 1   1.0000   0.9744  0.99313
## Pos Pred Value             NA   1.0000   0.9415  0.62963
## Neg Pred Value             NA   1.0000   0.9904  0.98166
## Prevalence                  0   0.6741   0.2965  0.02935
## Detection Rate              0   0.6741   0.2899  0.01134
## Detection Prevalence        0   0.6741   0.3079  0.01801
## Balanced Accuracy          NA   1.0000   0.9759  0.68975
summary(GBM)

##                                     var     rel.inf
## HighLowSeverity1       HighLowSeverity1 94.00852390
## Distance.mi.               Distance.mi.  3.34473475
## area                               area  1.31649455
## people                           people  0.54168596
## Wind_Speed.mph.         Wind_Speed.mph.  0.15571432
## Wind_Chill.F.             Wind_Chill.F.  0.11190618
## Temperature.F.           Temperature.F.  0.11167327
## Pressure.in.               Pressure.in.  0.10260848
## SideL                             SideL  0.07999551
## Humidity...                 Humidity...  0.07958974
## SideR                             SideR  0.05721985
## num_objects                 num_objects  0.03560764
## Precipitation.in.     Precipitation.in.  0.03143756
## Visibility.mi.           Visibility.mi.  0.02280830
## Sunrise_SunsetDay     Sunrise_SunsetDay  0.00000000
## Sunrise_SunsetNight Sunrise_SunsetNight  0.00000000

Create data partition for ordinal regression

Now we partition the data to run another ordinal regression model.

accidents$Severity <- as.factor(accidents$Severity)
set.seed(12345)
trainingIndices <- createDataPartition(accidents$Severity, p = 0.7, list = FALSE)
training <- accidents[trainingIndices, ]
testing <- accidents[-trainingIndices, ]

Run an ordinal regression model for severity

When we run the ordinal regression model, we see that the accuracy is 67.59% and the kappa is 3.12%. So this is better than the model with Random Forest, but still not very good.

memory.size(max = T)
## [1] 6519.19
library(MASS)
ORD <- polr(Severity~Distance.mi.+area+people+Side+num_objects+Pressure.in.+Wind_Chill.F., 
            data = training
            )
ORD
## Call:
## polr(formula = Severity ~ Distance.mi. + area + people + Side + 
##     num_objects + Pressure.in. + Wind_Chill.F., data = training)
## 
## Coefficients:
##  Distance.mi.          area        people         SideL         SideR 
##  1.831844e-01 -5.784946e-04 -3.266840e-06  4.965012e-01  2.113288e+00 
##   num_objects  Pressure.in. Wind_Chill.F. 
## -7.199411e-01  7.258936e-02 -3.313287e-03 
## 
## Intercepts:
##       1|2       2|3       3|4 
## -4.956150  4.352426  7.230947 
## 
## Residual Deviance: 2733741.13 
## AIC: 2733763.13
predicted <- predict(ORD, newdata = testing)
confusionMatrix(predicted, testing$Severity)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      1      2      3      4
##          1      0      0      0      0
##          2    273 570182 245088  22470
##          3      0   4491   7132   3491
##          4      0    152    784    182
## 
## Overall Statistics
##                                         
##                Accuracy : 0.676         
##                  95% CI : (0.675, 0.677)
##     No Information Rate : 0.6729        
##     P-Value [Acc > NIR] : 3.555e-10     
##                                         
##                   Kappa : 0.0318        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3  Class: 4
## Sensitivity          0.0000000  0.99192 0.028189 0.0069617
## Specificity          1.0000000  0.04148 0.986724 0.9988697
## Pos Pred Value             NaN  0.68040 0.471880 0.1627907
## Neg Pred Value       0.9996804  0.71396 0.706992 0.9695696
## Prevalence           0.0003196  0.67290 0.296173 0.0306036
## Detection Rate       0.0000000  0.66747 0.008349 0.0002131
## Detection Prevalence 0.0000000  0.98100 0.017693 0.0013088
## Balanced Accuracy    0.5000000  0.51670 0.507457 0.5029157

The conclusion

To conclude this project, we would need more information to find out what particular factors of car accidents are associated with traffic interference. As a follow up study, we could address some of the measurement of the variables in this data set. For instance, Distance and Visibility were displayed in miles. This seems like an unreasonable measurement for these variables because the amount of road space taken up by car accidents is usually less than one mile and people do not often see more than one mile ahead of them when they drive. Another thing to look at is how to better define the NA values in this data set. The Precipitation variable had a lot of NA values that were assumed to be zero in this study. It might be better to look into some of these accidents with these NA values and try to find out whether or not there really was some amount of precipitation. If there was, maybe we should try to approximate the amount of precipitation instead of assuming it is zero. We may also have to look into other potential factors of car accidents that lead to traffic interference and other methods of choosing which factors are most important. This study used random forest and gradient boosting machine to examine which factors for must important. The follow up study could try using lasso or MARS for these purposes.

Link to accidents data set: https://www.kaggle.com/sobhanmoosavi/us-accidents