The dataset used in this analysis, titled Australian Fatal Road Accidents (1989–2021), contains comprehensive data on road crashes across Australia. In this case, the main goal is to determine patterns of crashes, the part played by various user classes and prime factors of accidents within the given time span. To achieve these goals our analysis is limited to certain variables which includes crash type, the road user involved, their ages, as well as the period in the day they incurred the accident. The remedy will involve data cleaning, data exploration, hypothesis checking via inferential statistics, predictive modelling, logistic regression and random forest to make the inferences on the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(shiny)
library(ggplot2)
library(dplyr)
data <- read_excel("D:/Crash_Data.xlsx")
head(data)
## # A tibble: 6 × 23
## `Crash ID` State Month Year Dayweek Time `Crash Type`
## <dbl> <chr> <dbl> <dbl> <chr> <dttm> <chr>
## 1 20212133 Vic 9 2021 Sunday 1899-12-31 00:30:00 Single
## 2 20214022 SA 9 2021 Saturday 1899-12-31 23:31:00 Multiple
## 3 20212096 Vic 9 2021 Saturday 1899-12-31 23:00:00 Single
## 4 20212145 Vic 9 2021 Saturday 1899-12-31 22:25:00 Single
## 5 20212075 Vic 9 2021 Saturday 1899-12-31 05:15:00 Single
## 6 20213034 Qld 9 2021 Saturday 1899-12-31 04:00:00 Multiple
## # ℹ 16 more variables: `Bus Involvement` <chr>,
## # `Heavy Rigid Truck Involvement` <chr>,
## # `Articulated Truck Involvement` <chr>, `Speed Limit` <dbl>,
## # `Road User` <chr>, Gender <chr>, Age <dbl>,
## # `National Remoteness Areas` <chr>, `SA4 Name 2016` <chr>,
## # `National LGA Name 2017` <chr>, `National Road Type` <chr>,
## # `Christmas Period` <chr>, `Easter Period` <chr>, `Age Group` <chr>, …
str(data)
## tibble [52,843 × 23] (S3: tbl_df/tbl/data.frame)
## $ Crash ID : num [1:52843] 20212133 20214022 20212096 20212145 20212075 ...
## $ State : chr [1:52843] "Vic" "SA" "Vic" "Vic" ...
## $ Month : num [1:52843] 9 9 9 9 9 9 9 9 9 9 ...
## $ Year : num [1:52843] 2021 2021 2021 2021 2021 ...
## $ Dayweek : chr [1:52843] "Sunday" "Saturday" "Saturday" "Saturday" ...
## $ Time : POSIXct[1:52843], format: "1899-12-31 00:30:00" "1899-12-31 23:31:00" ...
## $ Crash Type : chr [1:52843] "Single" "Multiple" "Single" "Single" ...
## $ Bus Involvement : chr [1:52843] NA "No" NA NA ...
## $ Heavy Rigid Truck Involvement: chr [1:52843] NA "No" NA NA ...
## $ Articulated Truck Involvement: chr [1:52843] NA "No" NA NA ...
## $ Speed Limit : num [1:52843] NA 110 NA NA NA 100 100 NA NA 60 ...
## $ Road User : chr [1:52843] "Motorcycle rider" "Pedestrian" "Passenger" "Driver" ...
## $ Gender : chr [1:52843] "Male" "Female" "Male" "Male" ...
## $ Age : num [1:52843] 38 28 19 23 46 19 20 17 2 47 ...
## $ National Remoteness Areas : chr [1:52843] "Inner Regional Australia" "Major Cities of Australia" "Inner Regional Australia" "Outer Regional Australia" ...
## $ SA4 Name 2016 : chr [1:52843] "Melbourne - Outer East" "Adelaide - North" "Hume" "Hume" ...
## $ National LGA Name 2017 : chr [1:52843] "Yarra Ranges (S)" "Playford (C)" "Wangaratta (RC)" "Wangaratta (RC)" ...
## $ National Road Type : chr [1:52843] "Arterial Road" NA "Access road" "Arterial Road" ...
## $ Christmas Period : chr [1:52843] "No" "No" "No" "No" ...
## $ Easter Period : chr [1:52843] "No" "No" "No" "No" ...
## $ Age Group : chr [1:52843] "26_to_39" "26_to_39" "17_to_25" "17_to_25" ...
## $ Day of week : chr [1:52843] "Weekend" "Weekend" "Weekend" "Weekend" ...
## $ Time of day : chr [1:52843] "Night" "Night" "Night" "Night" ...
missing_data <- colSums(is.na(data))
print(missing_data)
## Crash ID State
## 0 0
## Month Year
## 0 0
## Dayweek Time
## 0 40
## Crash Type Bus Involvement
## 0 22
## Heavy Rigid Truck Involvement Articulated Truck Involvement
## 20515 22
## Speed Limit Road User
## 709 0
## Gender Age
## 27 0
## National Remoteness Areas SA4 Name 2016
## 45965 45951
## National LGA Name 2017 National Road Type
## 45950 45966
## Christmas Period Easter Period
## 0 0
## Age Group Day of week
## 90 0
## Time of day
## 0
data_clean <- na.omit(data)
print("Summary statistics:")
## [1] "Summary statistics:"
summary(data)
## Crash ID State Month Year
## Min. :19891001 Length:52843 Min. : 1.000 Min. :1989
## 1st Qu.:19951106 Class :character 1st Qu.: 4.000 1st Qu.:1995
## Median :20021441 Mode :character Median : 7.000 Median :2002
## Mean :20030206 Mean : 6.569 Mean :2003
## 3rd Qu.:20104082 3rd Qu.:10.000 3rd Qu.:2010
## Max. :20218009 Max. :12.000 Max. :2021
##
## Dayweek Time Crash Type
## Length:52843 Min. :1899-12-31 00:00:00.00 Length:52843
## Class :character 1st Qu.:1899-12-31 08:15:00.00 Class :character
## Mode :character Median :1899-12-31 14:00:00.00 Mode :character
## Mean :1899-12-31 13:03:29.09
## 3rd Qu.:1899-12-31 18:00:00.00
## Max. :1899-12-31 23:59:00.00
## NA's :40
## Bus Involvement Heavy Rigid Truck Involvement Articulated Truck Involvement
## Length:52843 Length:52843 Length:52843
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Speed Limit Road User Gender Age
## Min. : -9.00 Length:52843 Length:52843 Min. : -9.00
## 1st Qu.: 60.00 Class :character Class :character 1st Qu.: 22.00
## Median : 80.00 Mode :character Mode :character Median : 34.00
## Mean : 81.86 Mean : 39.66
## 3rd Qu.:100.00 3rd Qu.: 55.00
## Max. :130.00 Max. :101.00
## NA's :709
## National Remoteness Areas SA4 Name 2016 National LGA Name 2017
## Length:52843 Length:52843 Length:52843
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## National Road Type Christmas Period Easter Period Age Group
## Length:52843 Length:52843 Length:52843 Length:52843
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Day of week Time of day
## Length:52843 Length:52843
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The first process of an analysis is data importing, which can be executed using the read_excel() function, this helps to review several rows in a dataset in order to determine its structure (Abd Aziz et al., 2022). For data quality check, we first scanned for the missing values in the data set by using the colSums(is.na(data)) and dealt with them by eliminating the rows with the missing values using the na.omit() function. This move was an essential step to avoid cases of mistakes when performing the analysis.
Also, all the names of columns were modified to reduce confusion by making them all appear in the same format. Some of the column names including Crash Type, Bus Involvement, and Road User were transformed to easily understandable formats known as “Crash_Type” and “Road_User”, respectively, so that it was uniform through the rows of the database. This standardization is important in order to obtain reliable calculations (Sangare et al., 2021).
Next we did a brief description of the dataset using the functions str() and summary(), which give an idea on the type of data stored in a variable, as well as the mean, median, quartiles among others, which give an idea of the distribution of a given data.
colnames(data)[colnames(data) == "Crash Type"] <- "Crash_Type"
colnames(data)[colnames(data) == "Bus Involvement"] <- "Bus_Involvement"
colnames(data)[colnames(data) == "Heavy Rigid Truck Involvement"] <- "Heavy_Rigid_Truck_Involvement"
colnames(data)[colnames(data) == "Articulated Truck Involvement"] <- "Articulated_Truck_Involvement"
colnames(data)[colnames(data) == "Road User"] <- "Road_User"
colnames(data)[colnames(data) == "Age Group"] <- "Age_Group"
colnames(data)[colnames(data) == "Day of week"] <- "Day_of_week"
colnames(data)[colnames(data) == "Time of day"] <- "Time_of_day"
unique(data$`Crash_Type`)
## [1] "Single" "Multiple"
data %>%
group_by(`Crash_Type`) %>%
summarise(count = n())
## # A tibble: 2 × 2
## Crash_Type count
## <chr> <int>
## 1 Multiple 23594
## 2 Single 29249
The exploratory analysis started with the creation of new crash type variables and road user variables, which were accomplished through the use of unique() function, and grouped by. A bar chart presented crash distributions categorizing vehicles, drivers, pedestrians, and cyclists, the drivers are more likely to be involved in single-car crashes while pedestrians and cyclists are more exposed to multi-vehicle crashes (Huang et al., 2022).
The line chart depicting the number of crashes gives an indication of the yearly pattern of the same from 1989 to 2021. There is a sudden increase in the crashes in the mid-2000s and a relatively decrease during the later years. This trend indicates that, despite the possibility that road safety measures have enhanced, there was a time that had high numbers of fatalities during the early times of the data set. A heatmap was taken for commonality between crash types and road users. It established that certain crash types, for instance, single vehicle, a specific type of crash commonly involving drivers while other types such as pedestrian crash are more associated with pedestrians.
Bar Chart
crash_type_counts <- data %>%
group_by(Crash_Type, Road_User) %>%
summarise(count = n(), .groups = 'drop')
ggplot(crash_type_counts, aes(x = reorder(Crash_Type, -count), y = count, fill = Road_User)) +
geom_bar(stat = "identity") +
labs(title = "Crash Type Count by Road User", x = "Crash Type", y = "Crash Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_minimal() +
coord_flip()
Line Chart
data$Year <- as.numeric(as.character(data$Year))
crash_rate_per_year <- data %>%
group_by(Year) %>%
summarise(crash_count = n(), .groups = 'drop')
ggplot(crash_rate_per_year, aes(x = Year, y = crash_count, group = 1)) +
geom_line(color = "blue") + # Adding a line in blue
labs(title = "Crash Count by Year", x = "Year", y = "Crash Count") +
theme_minimal()
Heat Map
crash_rate_data <- data %>%
group_by(Crash_Type, Road_User) %>%
summarise(crash_count = n(), .groups = 'drop')
ggplot(crash_rate_data, aes(x = Crash_Type, y = Road_User, fill = crash_count)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "red") +
labs(title = "Heatmap of Crash Rates by Type and User", x = "Crash Type", y = "Road User") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_minimal()
Pie Chart
crash_counts_by_user <- data %>%
group_by(Road_User) %>%
summarise(total_crashes = n(), .groups = 'drop')
ggplot(crash_counts_by_user, aes(x = "", y = total_crashes, fill = Road_User)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Proportion of Crashes by Road User") +
theme_void()
Last of all, a pie chart was developed to show the distribution of the crashes according to the road user. Crashes are dominated most by drivers but a reasonable quantity of them is caused by pedestrian, cyclists, and motorcyclists, hence there is need to enhance the safety for all the road users.
Box Plot
crash_count_by_type <- data %>%
group_by(Crash_Type) %>%
summarise(crash_count = n(), .groups = 'drop')
ggplot(crash_count_by_type, aes(x = Crash_Type, y = crash_count, fill = Crash_Type)) +
geom_boxplot() +
labs(title = "Distribution of Crash Count by Crash Type", x = "Crash Type", y = "Crash Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
A box plot was also created in order to present the distribution of the crash counts as per crash types. This map showed that some crash types, such as involving heavy vehicles, have higher variance in the crash occurrence, and very high numbers among them which may refer to more serious crashes.
T Test
colnames(data_clean)[colnames(data_clean) == "Road User"] <- "Road_User"
pedestrian_vs_driver <- data_clean %>%
filter(Road_User %in% c("Pedestrian", "Driver"))
t_test_result <- t.test(`Crash ID` ~ Road_User, data = pedestrian_vs_driver)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: Crash ID by Road_User
## t = 1.1476, df = 1517.3, p-value = 0.2513
## alternative hypothesis: true difference in means between group Driver and group Pedestrian is not equal to 0
## 95 percent confidence interval:
## -600.4636 2293.8229
## sample estimates:
## mean in group Driver mean in group Pedestrian
## 20181247 20180400
A t-test was conducted to compare crash counts between pedestrians and drivers. The test was meant to find out whether there were any significant differences in the number of accidents that involved these two road users (Nassiri, Mohammadpour & Dahaghin, 2023). They showed that there was a difference, calling for a statistic value to define whether two means, in this case, pedestrians and drivers, are equal where with a p-value, again showing that pedestrians are at a higher risk of getting involved in road accidents.
Anova
anova_result <- aov(`Crash ID` ~ Road_User, data = data_clean)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Road_User 6 6.191e+09 1.032e+09 2.569 0.0174 *
## Residuals 6777 2.722e+12 4.017e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An ANOVA test was performed to evaluate differences in crash counts across multiple road user categories. Further, concerning the keyed and unkeyed crashes, and changing the aspect of ‘state cause code,’ the crosstab approach confirmed there were differences in the number of crashes among cyclists, pedestrians, and drivers, whereby significance was determined by the p-value at least one group was significantly different to other groups. This indicates that crash occurrence is not unique and depends on the type of roadway user (Tao et al., 2022).
Chi Square Test
contingency_table <- table(data_clean$Road_User, data_clean$`Crash Type`)
chi_square_result <- chisq.test(contingency_table)
print(chi_square_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 544.09, df = 6, p-value < 2.2e-16
Finally, Chi-Square Test was conducted to test the relationship between the road users and the crash types. The results of the Chi-square test show that these distributions were not independent and the p-value pointed out the fact that certain specific crash types are more related to specific road users.
library(tidyverse)
data_clean$`Crash Type` <- factor(data_clean$`Crash Type`)
data_clean$`Day of week` <- factor(data_clean$`Day of week`)
logit_model <- glm(`Crash Type` ~ `Speed Limit` + Age + `Day of week`,
family = binomial(link = "logit"),
data = data_clean)
summary(logit_model)
##
## Call:
## glm(formula = `Crash Type` ~ `Speed Limit` + Age + `Day of week`,
## family = binomial(link = "logit"), data = data_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.631174 0.116316 5.426 5.75e-08 ***
## `Speed Limit` -0.002719 0.001126 -2.415 0.0158 *
## Age -0.006394 0.001131 -5.655 1.56e-08 ***
## `Day of week`Weekend 0.316247 0.051229 6.173 6.69e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9307.2 on 6783 degrees of freedom
## Residual deviance: 9221.6 on 6780 degrees of freedom
## AIC: 9229.6
##
## Number of Fisher Scoring iterations: 4
predicted_probs <- predict(logit_model, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, "Multiple", "Single")
confusion_matrix <- table(Predicted = predicted_class, Actual = data_clean$`Crash Type`)
print(confusion_matrix)
## Actual
## Predicted Multiple Single
## Multiple 2389 3242
## Single 597 556
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy: ", round(accuracy, 4)))
## [1] "Accuracy: 0.4341"
In the analysis, Logistic Regression Model was used for predicting crash types depending on factors like speed limit, age as well as day of the week, day of the year (Rahman et al., 2022). Finally, the result indicated in Table 7 depicted the confusion matrix that shows that the model has reasonably good prediction capability to correctly classify the crash types. The study was able to help identify the extent to which those factors impacted on crash outcomes, by the aid of the use of logistic regression model.
Random Forest
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_model <- randomForest(`Crash Type` ~ Age + data_clean$`Speed Limit`+ data_clean$`Day of week`+ data_clean$Gender + data_clean$`Time of day`,
data = data_clean,
importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = `Crash Type` ~ Age + data_clean$`Speed Limit` + data_clean$`Day of week` + data_clean$Gender + data_clean$`Time of day`, data = data_clean, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 38.69%
## Confusion matrix:
## Multiple Single class.error
## Multiple 1582 1404 0.4701942
## Single 1221 2577 0.3214850
rf_pred <- predict(rf_model, data_clean)
rf_confusion_matrix <- table(Predicted = rf_pred, Actual = data_clean$`Crash Type`)
rf_accuracy <- sum(diag(rf_confusion_matrix)) / sum(rf_confusion_matrix)
print(paste("Random Forest Accuracy: ", round(rf_accuracy, 4)))
## [1] "Random Forest Accuracy: 0.6557"
Random Forest was also utilized for classification to come up with better results because it generates numerous decision trees from which it gives high predictive accuracy. This proved that even though logistic regression has been a good model, random forest proved to be superior through ensemble learning. Risk evaluation for assessing high-risk parameters for road accident is essential for carrying out necessary measures for prevention or controlling the rate of occurrence.
The analysis revealed significant crash patterns, including the influence of road user categories, crash types, and contributing factors like speed and age. Studying these characteristics is very essential for policy formulation and enhancing the safety of roads for use. Nevertheless, one still gets the impression of a number of deficits that may have implications of data quality and, therefore, new opportunities for studying the dataset could be identified.
Abd Aziz, F. S., Abdullah, K. H., Harith, S. H., & Sofyan, D. (2022). Trends and evolution of road user behaviour research: A bibliometric review. International Journal of Information Science and Management (IJISM), 20(3), 69-93. Retrieved from: https://ijism.isc.ac/article_698416_7fc0b4dc64019ae20a0b3cfaa5f07e98.pdf [Retrieved on: 05.02.25]
Huang, R., Liu, H., Ma, H., Qiang, Y., Pan, K., Gou, X., … & Glowacz, A. (2022). Accident prevention analysis: Exploring the intellectual structure of a research field. Sustainability, 14(14), 8784. Retrieved from: https://www.mdpi.com/2071-1050/14/14/8784/pdf [Retrieved on: 05.02.25]
Nassiri, H., Mohammadpour, S. I., & Dahaghin, M. (2023). Forecasting time trend of road traffic crashes in Iran using the macro-scale traffic flow characteristics. Heliyon, 9(3). Retrieved from: https://scholar.google.com/scholar?output=instlink&q=info:ikUDDOBRllUJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2021&scillfp=13072865154789522004&oi=lle [Retrieved on: 02.05.25]
Rahman, M. M., Islam, M. K., Al-Shayeb, A., & Arifuzzaman, M. (2022). Towards sustainable road safety in Saudi Arabia: Exploring traffic accident causes associated with driving behavior using a Bayesian belief network. Sustainability, 14(10), 6315. Retrieved from: https://www.mdpi.com/2071-1050/14/10/6315/pdf [Retrieved on: 05.02.25]
Sangare, M., Gupta, S., Bouzefrane, S., Banerjee, S., & Muhlethaler, P. (2021). Exploring the forecasting approach for road accidents: Analytical measures with hybrid machine learning. Expert Systems with Applications, 167, 113855. Retrieved from: https://www.sciencedirect.com/science/article/am/pii/S0957417420306667 [Retrieved on: 05.02.25]
Tao, W., Aghaabbasi, M., Ali, M., Almaliki, A. H., Zainol, R., Almaliki, A. A., & Hussein, E. E. (2022). An advanced machine learning approach to predicting pedestrian fatality caused by road crashes: A step toward sustainable pedestrian safety. Sustainability, 14(4), 2436. Retrieved from: https://www.mdpi.com/2071-1050/14/4/2436/pdf [Retrieved on: 02.05.25]