S2155520 || Hossein Golmohammadi
S2108177 || Rahman karimiyazdi
S2162214 || Yang Wang
S2150932 || Amirah Nur Binti Azman
S2163882 || Yaping Wang
Driving is the most dangerous thing most Americans do every day. Virtually every American knows someone who’s been injured in a car crash, and each year cars kill about as many people as guns and severely injure millions. Cars killed 42,060 people in 2020, up from 39,107 in 2019, according to a preliminary estimate from the National Safety Council (NSC). An accident dataset was analyzed in order to provide a better understanding of this problem.
This project we will analyze it mainly using R to determine the main causes of accidents, which is a classification task to achieve road traffic accident classification. In general, the project has two objectives:
Severity of Traffic Accidents Data Analysis
To predict Severity traffic accidents
First, the data was uploaded by the read.csv command in order to access the data. This dataset has 12,316 records or (rows) and 15 features (columns).
data <- read.csv("traffic_Accident.csv",header =T,na.strings=c(""))
head(data,3)
#> Age_Categories Sex_of_driver Educational_level Vehicle_driver_relation
#> 1 18-30 Male Above high school Employee
#> 2 31-50 Male Junior high school Employee
#> 3 18-30 Male Junior high school Employee
#> Driving_experience Lanes_or_Medians Types_of_Junction Road_surface_type
#> 1 1-2yr <NA> No junction Asphalt roads
#> 2 Above 10yr Undivided Two way No junction Asphalt roads
#> 3 1-2yr other No junction Asphalt roads
#> Light_conditions Weather_conditions Type_of_collision
#> 1 Daylight Normal Collision with roadside-parked vehicles
#> 2 Daylight Normal Vehicle with vehicle collision
#> 3 Daylight Normal Collision with roadside objects
#> Vehicle_movement Pedestrian_movement Cause_of_accident
#> 1 Going straight Not a Pedestrian Moving Backward
#> 2 Going straight Not a Pedestrian Overtaking
#> 3 Going straight Not a Pedestrian Changing lane to the left
#> Accident_severity
#> 1 NA
#> 2 NA
#> 3 1
dim(data)
#> [1] 12316 15
Summary is a generic function used to produce result summaries, including the number of records, unique categories, mean, standard deviation, min, max and quartiles of the dataset.
summary(data)
#> Age_Categories Sex_of_driver Educational_level
#> Length:12316 Length:12316 Length:12316
#> Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character
#>
#>
#>
#>
#> Vehicle_driver_relation Driving_experience Lanes_or_Medians
#> Length:12316 Length:12316 Length:12316
#> Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character
#>
#>
#>
#>
#> Types_of_Junction Road_surface_type Light_conditions Weather_conditions
#> Length:12316 Length:12316 Length:12316 Length:12316
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#>
#> Type_of_collision Vehicle_movement Pedestrian_movement Cause_of_accident
#> Length:12316 Length:12316 Length:12316 Length:12316
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#>
#> Accident_severity
#> Min. :0.000
#> 1st Qu.:2.000
#> Median :2.000
#> Mean :1.833
#> 3rd Qu.:2.000
#> Max. :2.000
#> NA's :13
The is.na function shows the number of nulls per column.
colSums(is.na(data)==TRUE|data=='')
#> Age_Categories Sex_of_driver Educational_level
#> 1548 178 841
#> Vehicle_driver_relation Driving_experience Lanes_or_Medians
#> 593 862 442
#> Types_of_Junction Road_surface_type Light_conditions
#> 1078 172 0
#> Weather_conditions Type_of_collision Vehicle_movement
#> 292 169 396
#> Pedestrian_movement Cause_of_accident Accident_severity
#> 0 25 13
There are a total of 6582 Null values in the dataset. Based on the line graph below, Age_Categories has the highest Null values of more than 1500 records. While Accident_severity has the lowest Null values with only 13 records.
gg_miss_var(data)
#> Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
#> ggplot2 3.3.4.
#> ℹ Please use "none" instead.
#> ℹ The deprecated feature was likely used in the naniar package.
#> Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.As the graph below shows, Black color cells indicate percentage of Null values and gray color cells indicate percentage of Non Null values. Age_Categories has the highest percentage of Null values with 12.57% and the lowest percentage of Null values is 0.09%. Overall, this dataset has 3.6% of Null values compared to the total size of the dataset.
vis_miss(data)
#> Warning: `gather_()` was deprecated in tidyr 1.2.0.
#> ℹ Please use `gather()` instead.
#> ℹ The deprecated feature was likely used in the visdat package.
#> Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.The missing data can create problems for analyzing data. Replacing missing data with substituted values is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. In the “Accident_Severity” column. The null values were replaced by the mean of the feature.
data$Accident_severity[is.na(data$Accident_severity)==TRUE] = round(mean(data$Accident_severity, na.rm = TRUE))The null values in the four other columns including “Age_Categories”,
“Educational_level”, “Driving_experience”, “Lanes_or_Medians”,
“Educational_level”, “Driving_experience”, “Lanes_or_Medians” were
replaced with the mode of the feature as shown in below
respectively.
getmode <- function(a) {
uniqage <- unique(a)
uniqage[which.max(tabulate(match(a, uniqage)))]
}
data$Age_Categories[is.na(data$Age_Categories)==TRUE]= getmode(data$Age_Categories)
print( getmode(data$Age_Categories))
#> [1] "18-30"
data$Educational_level[is.na(data$Educational_level)==TRUE]= getmode(data$Educational_level)
print( getmode(data$Educational_level))
#> [1] "Junior high school"
data$Driving_experience[is.na(data$Driving_experience)==TRUE] = getmode(data$Driving_experience)
print(getmode(data$Driving_experience))
#> [1] "5-10yr"
data$Lanes_or_Medians[is.na(data$Lanes_or_Medians)==TRUE] = getmode(data$Lanes_or_Medians)
print(getmode(data$Lanes_or_Medians))
#> [1] "Two-way (divided with broken lines road marking)"
The missing values in other columns including, “Road_surface_type”,
“Type_of_collision”, “Cause_of_accident”, “Types_of_Junction”,
“Vehicle_movement”, “Vehicle_driver_relation”, “Weather_conditions”,
“Sex_of_driver” were replaced by the “Unknown” keyword because of the
importance and high sensitivity of the columns’ values in the analysis
stage and obviously making a better decision based on them.
data$Road_surface_type[is.na(data$Road_surface_type)==TRUE]= "Unknown"
data$Type_of_collision[is.na(data$Type_of_collision)==TRUE]= "Unknown"
data$Cause_of_accident[is.na(data$Cause_of_accident)==TRUE]= "Unknown"
data$Types_of_Junction [is.na(data$Types_of_Junction )==TRUE]= "Unknown"
data$Vehicle_movement[is.na(data$Vehicle_movement)==TRUE]= "Unknown"
data$Vehicle_driver_relation[is.na(data$Vehicle_driver_relation)==TRUE]= "Unknown"
data$Weather_conditions[is.na(data$Weather_conditions)==TRUE]= "Unknown"
data$Sex_of_driver[is.na(data$Sex_of_driver)==TRUE]= "Unknown"
After imputation the dataset, the vis_miss function was run to recheck the null data situation. Fortunately, There was no missing data anymore.
vis_miss(data)
colSums(is.na(data)==TRUE|data=='')
#> Age_Categories Sex_of_driver Educational_level
#> 0 0 0
#> Vehicle_driver_relation Driving_experience Lanes_or_Medians
#> 0 0 0
#> Types_of_Junction Road_surface_type Light_conditions
#> 0 0 0
#> Weather_conditions Type_of_collision Vehicle_movement
#> 0 0 0
#> Pedestrian_movement Cause_of_accident Accident_severity
#> 0 0 0
The diagram below shows the number of accidents in every level of
severity and every level categorized by the number of collision types.
The highest level of severity accounts for most of the accidents with
vehicle collisions.
ggplot(data) +
geom_bar(aes(x = Accident_severity, fill = Type_of_collision)) +
xlab("Severity of Accidents") + ylab("Number of Accidents")
There are plenty of causes for an accident to take place. The highest
cause for an accident is “No Distance” with a total of 2,250 accidents.
This indicated that accidents happened due to no safe distance between
vehicles when on the road.
p_causes <- ggplot(data) +
geom_bar(aes(x = Cause_of_accident), position = "dodge", width = 0.2) +
xlab("Causes of Accidents") + ylab("Number of Accidents")
p_causes + coord_flip()The overwhelming majority of drivers are men in this dataset. There is a small part in the dataset with unknown sex.
ggplot() +
geom_bar(data,mapping=aes(x = "", y = "", fill = Sex_of_driver), width = 1, stat = "identity") +
coord_polar("y", start = 0)
In both female and male gender, the majority of accident occur in the
highest level of severity.
ggplot(data, aes(x=Accident_severity, fill=Sex_of_driver )) +
geom_bar(position = "dodge") +
labs(x="Severity of Accidents",y="Number of Accidents")
Accidents with the most severity (Class 2) have different age groups
of drivers. The highest number of accidents in level 2 belongs to
drivers aged between 18 to 30 years old. This group of age has more than
3500 number of accidents recorded. The second highest belongs to drivers
aged between 31 to 50 years old with 3500 number of accidents
recorded.
ggplot(data, aes(x=Accident_severity, fill=Age_Categories )) +
geom_bar(position = "dodge") +
labs(x="Severity of Accidents",y="Number of Accidents")Driving experiences of each driver on level 2 has been categorized
based on the total years driving experiences. Drivers with 5 to 10 years
of driving experience have the highest number of accidents in Class 2.
This category counted more than 2,500 accidents. While drivers with 2 to
5 years of driving experience in Class 2 have a record of slightly more
than 2,000 accidents.
ggplot(data, aes(x=Accident_severity, fill=Driving_experience )) +
geom_bar(position = "dodge") +
labs(x="Severity of Accidents",y="Number of Accidents")Accidents with the most severity in Class 2 occurred at various
junction on the road. Based on the stacked bar graph, accidents mostly
happened at the “Y” shape junction as well as at location with No
junction. Both locations of the road have equal total of accidents
recorded with about 2500 of accidents.
ggplot(data) +
geom_bar(aes(x = Accident_severity, fill = Types_of_Junction)) +
xlab("Severity of Accidents") + ylab("Number of Accidents")As the diagram shows, the vast majority of accidents occur in normal
weather. It seems weather conditions do not play an important role in
the occurrence of accidents.
p_weather <- ggplot(data) +
geom_bar(aes(x =Weather_conditions , fill = Weather_conditions), width = 0.5) +
xlab("Weather Condistions") + ylab("Number of Accidents")
p_weather + coord_flip()The number of accidents that happen on Asphalt roads are dramatically
higher than other types of road. It seems the type of roads do not play
an important role in the occurrence of accidents.
p_road <- ggplot(data) +
geom_bar(aes(x = Road_surface_type, fill=Road_surface_type), position = "dodge", width = 0.5) +
xlab("Type of Roads Surface") + ylab("Number of Accidents")
p_road + coord_flip()The majority of the accidents took place during the daylight in each
of the accident severity classes. However, conditions with total
darkness and no lighting are also considerable.
ggplot(data) + geom_mosaic(aes(x = product(Light_conditions, Accident_severity), fill = Light_conditions)) +
xlab("Severity of Accidents") + ylab("Light Conditions")
Machine learning is about learning to predict something or extracting
knowledge from data. ML algorithms build a model based on sample data or
known as training data and based upon the training data the algorithm
can predict something on new data. The central goal of this section is
to design a model that makes proper accident severity classifications
for new accidents. In the first step, it was essential to change the
“Accident_severity” type to factor in order the ML algorithm can be
run.
data$Accident_severity <- as.factor(data$Accident_severity)
str(data)
#> 'data.frame': 12316 obs. of 15 variables:
#> $ Age_Categories : chr "18-30" "31-50" "18-30" "18-30" ...
#> $ Sex_of_driver : chr "Male" "Male" "Male" "Male" ...
#> $ Educational_level : chr "Above high school" "Junior high school" "Junior high school" "Junior high school" ...
#> $ Vehicle_driver_relation: chr "Employee" "Employee" "Employee" "Employee" ...
#> $ Driving_experience : chr "1-2yr" "Above 10yr" "1-2yr" "5-10yr" ...
#> $ Lanes_or_Medians : chr "Two-way (divided with broken lines road marking)" "Undivided Two way" "other" "other" ...
#> $ Types_of_Junction : chr "No junction" "No junction" "No junction" "Y Shape" ...
#> $ Road_surface_type : chr "Asphalt roads" "Asphalt roads" "Asphalt roads" "Earth roads" ...
#> $ Light_conditions : chr "Daylight" "Daylight" "Daylight" "Darkness - lights lit" ...
#> $ Weather_conditions : chr "Normal" "Normal" "Normal" "Normal" ...
#> $ Type_of_collision : chr "Collision with roadside-parked vehicles" "Vehicle with vehicle collision" "Collision with roadside objects" "Vehicle with vehicle collision" ...
#> $ Vehicle_movement : chr "Going straight" "Going straight" "Going straight" "Going straight" ...
#> $ Pedestrian_movement : chr "Not a Pedestrian" "Not a Pedestrian" "Not a Pedestrian" "Not a Pedestrian" ...
#> $ Cause_of_accident : chr "Moving Backward" "Overtaking" "Changing lane to the left" "Changing lane to the right" ...
#> $ Accident_severity : Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 ...The first ML model is Random Forest. It builds and combines multiple
decision trees to get more accurate predictions. It’s a non-linear
classification algorithm. They are called random because they choose
predictors randomly at a time of training. They are called forest
because they take the output of multiple trees to make a decision.
Random forest outperforms decision trees as a large number of
uncorrelated trees(models) operating as a committee will always
outperform the individual constituent models. In the below the function
is created to run the Random Forest.
RF<-function(s, df, col) {
trainIndex<-createDataPartition(col, p=s, list=F)
data_train<-df[trainIndex,]
data_test<-df[-trainIndex,]
model <- randomForest(Accident_severity~., data=data_train)
# make predictions
x_test <- data_test[,1:14]
y_test <- data_test[,15]
predictions <- predict(model, x_test)
cm<-confusionMatrix(predictions, y_test)
return(cm)
}The data splitted to 70% and 30% for train and test respectively and
then the results are shown. The model has 84% Accuracy.
split<-0.70 # 70%/30% train/test
result4<-RF(split, data, data$Accident_severity)
result4
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1 2
#> 0 0 0 0
#> 1 0 0 5
#> 2 47 522 3120
#>
#> Overall Statistics
#>
#> Accuracy : 0.8446
#> 95% CI : (0.8325, 0.8562)
#> No Information Rate : 0.846
#> P-Value [Acc > NIR] : 0.6009
#>
#> Kappa : -0.0026
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 0 Class: 1 Class: 2
#> Sensitivity 0.00000 0.000000 0.9984
#> Specificity 1.00000 0.998424 0.0000
#> Pos Pred Value NaN 0.000000 0.8458
#> Neg Pred Value 0.98728 0.858498 0.0000
#> Prevalence 0.01272 0.141310 0.8460
#> Detection Rate 0.00000 0.000000 0.8446
#> Detection Prevalence 0.00000 0.001354 0.9986
#> Balanced Accuracy 0.50000 0.499212 0.4992Cross-validation is a resampling procedure used to evaluate machine
learning models on a limited data sample. The procedure has a single
parameter called k that refers to the number of groups that a given data
sample is to be split into. The data split it into 5 groups (k = 5).
mtry is the number of variables to randomly sample as candidates at each
split. As it shows the most accurate value for mtry was 2 with an
accuracy of 84%.
train_control <- trainControl(method="cv", number=5)
modelRF <- train(Accident_severity~., data=data, trControl=train_control, method="rf")
modelRF
#> Random Forest
#>
#> 12316 samples
#> 14 predictor
#> 3 classes: '0', '1', '2'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 9852, 9852, 9853, 9854, 9853
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8458104 0.000000000
#> 48 0.8304642 0.002228018
#> 94 0.8248619 0.005938231
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.k-Nearest Neighbor(k-NN) is an instance-based supervised learning
algorithm which classifies a new instance by comparing it with already
stored instances in the memory that have already been seen in training.
In the below the function is created to run the Random Forest.
KNN<-function(s, df, col) {
trainIndex<-createDataPartition(col, p=s, list=F)
data_train<-df[trainIndex,]
data_test<-df[-trainIndex,]
model <- train(Accident_severity~., data=data_train, method = "knn")
# make predictions
x_test <- data_test[,1:14]
y_test <- data_test[,15]
predictions <- predict(model, x_test)
cm<-confusionMatrix(predictions, y_test)
return(list(model, cm))
}The data splitted to 70% and 30% for train and test respectively and
then the results are shown. The model has 83.68% Accuracy when the K =
9.
split<-0.70 # 70%/30% train/test
result6<-KNN(split, data, data$Accident_severity)
result6
#> [[1]]
#> k-Nearest Neighbors
#>
#> 8622 samples
#> 14 predictor
#> 3 classes: '0', '1', '2'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 8622, 8622, 8622, 8622, 8622, 8622, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 5 0.8221499 0.006440964
#> 7 0.8322765 0.001662331
#> 9 0.8376776 0.002163838
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 9.
#>
#> [[2]]
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1 2
#> 0 0 0 0
#> 1 0 0 8
#> 2 47 522 3117
#>
#> Overall Statistics
#>
#> Accuracy : 0.8438
#> 95% CI : (0.8317, 0.8554)
#> No Information Rate : 0.846
#> P-Value [Acc > NIR] : 0.6524
#>
#> Kappa : -0.0041
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 0 Class: 1 Class: 2
#> Sensitivity 0.00000 0.000000 0.9974
#> Specificity 1.00000 0.997478 0.0000
#> Pos Pred Value NaN 0.000000 0.8456
#> Neg Pred Value 0.98728 0.858383 0.0000
#> Prevalence 0.01272 0.141310 0.8460
#> Detection Rate 0.00000 0.000000 0.8438
#> Detection Prevalence 0.00000 0.002166 0.9978
#> Balanced Accuracy 0.50000 0.498739 0.4987The data split it into 5 groups (k = 5). As it shows the most
accurate value for mtry was 9 with an accuracy of 84.5%.
train_control <- trainControl(method="cv", number=5)
modelknn <- train(Accident_severity~., data=data, trControl=train_control, method="knn")
modelknn
#> k-Nearest Neighbors
#>
#> 12316 samples
#> 14 predictor
#> 3 classes: '0', '1', '2'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 9853, 9852, 9853, 9854, 9852
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 5 0.8410197 0.0018354264
#> 7 0.8449983 0.0050478840
#> 9 0.8453231 0.0005437254
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 9.
As exploratory data analysis shows, the main factors that cause accidents include no distancing and sudden change lanes to the right and left. Lack of light also plays a vital role in accidents. Weather conditions and type of roads have a low influence on accident occurrence. Therefore, human error is the leading cause of accidents. By comparing the prediction results of random forest and the KNN algorithm, it is concluded that, in this case, the KNN algorithm with k-fold = 5 has higher accuracy for traffic accident severity prediction.