1. Introduction

The dataset used in this analysis, titled Australian Fatal Road Accidents (1989–2021), contains comprehensive data on road crashes across Australia. In this case, the main goal is to determine patterns of crashes, the part played by various user classes and prime factors of accidents within the given time span. To achieve these goals our analysis is limited to certain variables which includes crash type, the road user involved, their ages, as well as the period in the day they incurred the accident. The remedy will involve data cleaning, data exploration, hypothesis checking via inferential statistics, predictive modelling, logistic regression and random forest to make the inferences on the data.

2. Data Preprocessing and Cleaning

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(shiny)
library(ggplot2)
library(dplyr)

data <- read_excel("D:/Crash_Data.xlsx")
head(data)
## # A tibble: 6 × 23
##   `Crash ID` State Month  Year Dayweek  Time                `Crash Type`
##        <dbl> <chr> <dbl> <dbl> <chr>    <dttm>              <chr>       
## 1   20212133 Vic       9  2021 Sunday   1899-12-31 00:30:00 Single      
## 2   20214022 SA        9  2021 Saturday 1899-12-31 23:31:00 Multiple    
## 3   20212096 Vic       9  2021 Saturday 1899-12-31 23:00:00 Single      
## 4   20212145 Vic       9  2021 Saturday 1899-12-31 22:25:00 Single      
## 5   20212075 Vic       9  2021 Saturday 1899-12-31 05:15:00 Single      
## 6   20213034 Qld       9  2021 Saturday 1899-12-31 04:00:00 Multiple    
## # ℹ 16 more variables: `Bus Involvement` <chr>,
## #   `Heavy Rigid Truck Involvement` <chr>,
## #   `Articulated Truck Involvement` <chr>, `Speed Limit` <dbl>,
## #   `Road User` <chr>, Gender <chr>, Age <dbl>,
## #   `National Remoteness Areas` <chr>, `SA4 Name 2016` <chr>,
## #   `National LGA Name 2017` <chr>, `National Road Type` <chr>,
## #   `Christmas Period` <chr>, `Easter Period` <chr>, `Age Group` <chr>, …
str(data)
## tibble [52,843 × 23] (S3: tbl_df/tbl/data.frame)
##  $ Crash ID                     : num [1:52843] 20212133 20214022 20212096 20212145 20212075 ...
##  $ State                        : chr [1:52843] "Vic" "SA" "Vic" "Vic" ...
##  $ Month                        : num [1:52843] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Year                         : num [1:52843] 2021 2021 2021 2021 2021 ...
##  $ Dayweek                      : chr [1:52843] "Sunday" "Saturday" "Saturday" "Saturday" ...
##  $ Time                         : POSIXct[1:52843], format: "1899-12-31 00:30:00" "1899-12-31 23:31:00" ...
##  $ Crash Type                   : chr [1:52843] "Single" "Multiple" "Single" "Single" ...
##  $ Bus Involvement              : chr [1:52843] NA "No" NA NA ...
##  $ Heavy Rigid Truck Involvement: chr [1:52843] NA "No" NA NA ...
##  $ Articulated Truck Involvement: chr [1:52843] NA "No" NA NA ...
##  $ Speed Limit                  : num [1:52843] NA 110 NA NA NA 100 100 NA NA 60 ...
##  $ Road User                    : chr [1:52843] "Motorcycle rider" "Pedestrian" "Passenger" "Driver" ...
##  $ Gender                       : chr [1:52843] "Male" "Female" "Male" "Male" ...
##  $ Age                          : num [1:52843] 38 28 19 23 46 19 20 17 2 47 ...
##  $ National Remoteness Areas    : chr [1:52843] "Inner Regional Australia" "Major Cities of Australia" "Inner Regional Australia" "Outer Regional Australia" ...
##  $ SA4 Name 2016                : chr [1:52843] "Melbourne - Outer East" "Adelaide - North" "Hume" "Hume" ...
##  $ National LGA Name 2017       : chr [1:52843] "Yarra Ranges (S)" "Playford (C)" "Wangaratta (RC)" "Wangaratta (RC)" ...
##  $ National Road Type           : chr [1:52843] "Arterial Road" NA "Access road" "Arterial Road" ...
##  $ Christmas Period             : chr [1:52843] "No" "No" "No" "No" ...
##  $ Easter Period                : chr [1:52843] "No" "No" "No" "No" ...
##  $ Age Group                    : chr [1:52843] "26_to_39" "26_to_39" "17_to_25" "17_to_25" ...
##  $ Day of week                  : chr [1:52843] "Weekend" "Weekend" "Weekend" "Weekend" ...
##  $ Time of day                  : chr [1:52843] "Night" "Night" "Night" "Night" ...
missing_data <- colSums(is.na(data))
print(missing_data)
##                      Crash ID                         State 
##                             0                             0 
##                         Month                          Year 
##                             0                             0 
##                       Dayweek                          Time 
##                             0                            40 
##                    Crash Type               Bus Involvement 
##                             0                            22 
## Heavy Rigid Truck Involvement Articulated Truck Involvement 
##                         20515                            22 
##                   Speed Limit                     Road User 
##                           709                             0 
##                        Gender                           Age 
##                            27                             0 
##     National Remoteness Areas                 SA4 Name 2016 
##                         45965                         45951 
##        National LGA Name 2017            National Road Type 
##                         45950                         45966 
##              Christmas Period                 Easter Period 
##                             0                             0 
##                     Age Group                   Day of week 
##                            90                             0 
##                   Time of day 
##                             0
data_clean <- na.omit(data)
print("Summary statistics:")
## [1] "Summary statistics:"
summary(data)
##     Crash ID           State               Month             Year     
##  Min.   :19891001   Length:52843       Min.   : 1.000   Min.   :1989  
##  1st Qu.:19951106   Class :character   1st Qu.: 4.000   1st Qu.:1995  
##  Median :20021441   Mode  :character   Median : 7.000   Median :2002  
##  Mean   :20030206                      Mean   : 6.569   Mean   :2003  
##  3rd Qu.:20104082                      3rd Qu.:10.000   3rd Qu.:2010  
##  Max.   :20218009                      Max.   :12.000   Max.   :2021  
##                                                                       
##    Dayweek               Time                         Crash Type       
##  Length:52843       Min.   :1899-12-31 00:00:00.00   Length:52843      
##  Class :character   1st Qu.:1899-12-31 08:15:00.00   Class :character  
##  Mode  :character   Median :1899-12-31 14:00:00.00   Mode  :character  
##                     Mean   :1899-12-31 13:03:29.09                     
##                     3rd Qu.:1899-12-31 18:00:00.00                     
##                     Max.   :1899-12-31 23:59:00.00                     
##                     NA's   :40                                         
##  Bus Involvement    Heavy Rigid Truck Involvement Articulated Truck Involvement
##  Length:52843       Length:52843                  Length:52843                 
##  Class :character   Class :character              Class :character             
##  Mode  :character   Mode  :character              Mode  :character             
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##   Speed Limit      Road User            Gender               Age        
##  Min.   : -9.00   Length:52843       Length:52843       Min.   : -9.00  
##  1st Qu.: 60.00   Class :character   Class :character   1st Qu.: 22.00  
##  Median : 80.00   Mode  :character   Mode  :character   Median : 34.00  
##  Mean   : 81.86                                         Mean   : 39.66  
##  3rd Qu.:100.00                                         3rd Qu.: 55.00  
##  Max.   :130.00                                         Max.   :101.00  
##  NA's   :709                                                            
##  National Remoteness Areas SA4 Name 2016      National LGA Name 2017
##  Length:52843              Length:52843       Length:52843          
##  Class :character          Class :character   Class :character      
##  Mode  :character          Mode  :character   Mode  :character      
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  National Road Type Christmas Period   Easter Period       Age Group        
##  Length:52843       Length:52843       Length:52843       Length:52843      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Day of week        Time of day       
##  Length:52843       Length:52843      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

The first process of an analysis is data importing, which can be executed using the read_excel() function, this helps to review several rows in a dataset in order to determine its structure (Abd Aziz et al., 2022). For data quality check, we first scanned for the missing values in the data set by using the colSums(is.na(data)) and dealt with them by eliminating the rows with the missing values using the na.omit() function. This move was an essential step to avoid cases of mistakes when performing the analysis.

Also, all the names of columns were modified to reduce confusion by making them all appear in the same format. Some of the column names including Crash Type, Bus Involvement, and Road User were transformed to easily understandable formats known as “Crash_Type” and “Road_User”, respectively, so that it was uniform through the rows of the database. This standardization is important in order to obtain reliable calculations (Sangare et al., 2021).

Next we did a brief description of the dataset using the functions str() and summary(), which give an idea on the type of data stored in a variable, as well as the mean, median, quartiles among others, which give an idea of the distribution of a given data.

3. Exploratory Data Analysis (EDA)

colnames(data)[colnames(data) == "Crash Type"] <- "Crash_Type"
colnames(data)[colnames(data) == "Bus Involvement"] <- "Bus_Involvement"
colnames(data)[colnames(data) == "Heavy Rigid Truck Involvement"] <- "Heavy_Rigid_Truck_Involvement"
colnames(data)[colnames(data) == "Articulated Truck Involvement"] <- "Articulated_Truck_Involvement"
colnames(data)[colnames(data) == "Road User"] <- "Road_User"
colnames(data)[colnames(data) == "Age Group"] <- "Age_Group"
colnames(data)[colnames(data) == "Day of week"] <- "Day_of_week"
colnames(data)[colnames(data) == "Time of day"] <- "Time_of_day"
unique(data$`Crash_Type`)
## [1] "Single"   "Multiple"
data %>%
  group_by(`Crash_Type`) %>%
  summarise(count = n())
## # A tibble: 2 × 2
##   Crash_Type count
##   <chr>      <int>
## 1 Multiple   23594
## 2 Single     29249

The exploratory analysis started with the creation of new crash type variables and road user variables, which were accomplished through the use of unique() function, and grouped by. A bar chart presented crash distributions categorizing vehicles, drivers, pedestrians, and cyclists, the drivers are more likely to be involved in single-car crashes while pedestrians and cyclists are more exposed to multi-vehicle crashes (Huang et al., 2022).

The line chart depicting the number of crashes gives an indication of the yearly pattern of the same from 1989 to 2021. There is a sudden increase in the crashes in the mid-2000s and a relatively decrease during the later years. This trend indicates that, despite the possibility that road safety measures have enhanced, there was a time that had high numbers of fatalities during the early times of the data set. A heatmap was taken for commonality between crash types and road users. It established that certain crash types, for instance, single vehicle, a specific type of crash commonly involving drivers while other types such as pedestrian crash are more associated with pedestrians.

Bar Chart

crash_type_counts <- data %>%
  group_by(Crash_Type, Road_User) %>%
  summarise(count = n(), .groups = 'drop')

ggplot(crash_type_counts, aes(x = reorder(Crash_Type, -count), y = count, fill = Road_User)) +
  geom_bar(stat = "identity") +
  labs(title = "Crash Type Count by Road User", x = "Crash Type", y = "Crash Count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme_minimal() + 
  coord_flip()  

Line Chart

data$Year <- as.numeric(as.character(data$Year))

crash_rate_per_year <- data %>%
  group_by(Year) %>%
  summarise(crash_count = n(), .groups = 'drop')

ggplot(crash_rate_per_year, aes(x = Year, y = crash_count, group = 1)) +
  geom_line(color = "blue") +  # Adding a line in blue
  labs(title = "Crash Count by Year", x = "Year", y = "Crash Count") +
  theme_minimal()

Heat Map

crash_rate_data <- data %>%
  group_by(Crash_Type, Road_User) %>%
  summarise(crash_count = n(), .groups = 'drop')

ggplot(crash_rate_data, aes(x = Crash_Type, y = Road_User, fill = crash_count)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "red") +
  labs(title = "Heatmap of Crash Rates by Type and User", x = "Crash Type", y = "Road User") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme_minimal()

Pie Chart

crash_counts_by_user <- data %>%
  group_by(Road_User) %>%
  summarise(total_crashes = n(), .groups = 'drop')

ggplot(crash_counts_by_user, aes(x = "", y = total_crashes, fill = Road_User)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Proportion of Crashes by Road User") + 
  theme_void()

Last of all, a pie chart was developed to show the distribution of the crashes according to the road user. Crashes are dominated most by drivers but a reasonable quantity of them is caused by pedestrian, cyclists, and motorcyclists, hence there is need to enhance the safety for all the road users.

Box Plot

crash_count_by_type <- data %>%
  group_by(Crash_Type) %>%
  summarise(crash_count = n(), .groups = 'drop')

ggplot(crash_count_by_type, aes(x = Crash_Type, y = crash_count, fill = Crash_Type)) +
  geom_boxplot() +
  labs(title = "Distribution of Crash Count by Crash Type", x = "Crash Type", y = "Crash Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

A box plot was also created in order to present the distribution of the crash counts as per crash types. This map showed that some crash types, such as involving heavy vehicles, have higher variance in the crash occurrence, and very high numbers among them which may refer to more serious crashes.

4. Hypothesis Testing and Statistical Analysis

T Test

colnames(data_clean)[colnames(data_clean) == "Road User"] <- "Road_User"
pedestrian_vs_driver <- data_clean %>%
  filter(Road_User %in% c("Pedestrian", "Driver"))

t_test_result <- t.test(`Crash ID` ~ Road_User, data = pedestrian_vs_driver)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  Crash ID by Road_User
## t = 1.1476, df = 1517.3, p-value = 0.2513
## alternative hypothesis: true difference in means between group Driver and group Pedestrian is not equal to 0
## 95 percent confidence interval:
##  -600.4636 2293.8229
## sample estimates:
##     mean in group Driver mean in group Pedestrian 
##                 20181247                 20180400

A t-test was conducted to compare crash counts between pedestrians and drivers. The test was meant to find out whether there were any significant differences in the number of accidents that involved these two road users (Nassiri, Mohammadpour & Dahaghin, 2023). They showed that there was a difference, calling for a statistic value to define whether two means, in this case, pedestrians and drivers, are equal where with a p-value, again showing that pedestrians are at a higher risk of getting involved in road accidents.

Anova

anova_result <- aov(`Crash ID` ~ Road_User, data = data_clean)
summary(anova_result)
##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## Road_User      6 6.191e+09 1.032e+09   2.569 0.0174 *
## Residuals   6777 2.722e+12 4.017e+08                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An ANOVA test was performed to evaluate differences in crash counts across multiple road user categories. Further, concerning the keyed and unkeyed crashes, and changing the aspect of ‘state cause code,’ the crosstab approach confirmed there were differences in the number of crashes among cyclists, pedestrians, and drivers, whereby significance was determined by the p-value at least one group was significantly different to other groups. This indicates that crash occurrence is not unique and depends on the type of roadway user (Tao et al., 2022).

Chi Square Test

contingency_table <- table(data_clean$Road_User, data_clean$`Crash Type`)
chi_square_result <- chisq.test(contingency_table)
print(chi_square_result)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 544.09, df = 6, p-value < 2.2e-16

Finally, Chi-Square Test was conducted to test the relationship between the road users and the crash types. The results of the Chi-square test show that these distributions were not independent and the p-value pointed out the fact that certain specific crash types are more related to specific road users.

5. Predictive Modeling

library(tidyverse)

data_clean$`Crash Type` <- factor(data_clean$`Crash Type`)
data_clean$`Day of week` <- factor(data_clean$`Day of week`)

logit_model <- glm(`Crash Type` ~ `Speed Limit` + Age + `Day of week`, 
                   family = binomial(link = "logit"), 
                   data = data_clean)

summary(logit_model)
## 
## Call:
## glm(formula = `Crash Type` ~ `Speed Limit` + Age + `Day of week`, 
##     family = binomial(link = "logit"), data = data_clean)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.631174   0.116316   5.426 5.75e-08 ***
## `Speed Limit`        -0.002719   0.001126  -2.415   0.0158 *  
## Age                  -0.006394   0.001131  -5.655 1.56e-08 ***
## `Day of week`Weekend  0.316247   0.051229   6.173 6.69e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9307.2  on 6783  degrees of freedom
## Residual deviance: 9221.6  on 6780  degrees of freedom
## AIC: 9229.6
## 
## Number of Fisher Scoring iterations: 4
predicted_probs <- predict(logit_model, type = "response")  
predicted_class <- ifelse(predicted_probs > 0.5, "Multiple", "Single")  
confusion_matrix <- table(Predicted = predicted_class, Actual = data_clean$`Crash Type`)

print(confusion_matrix)
##           Actual
## Predicted  Multiple Single
##   Multiple     2389   3242
##   Single        597    556
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy: ", round(accuracy, 4)))
## [1] "Accuracy:  0.4341"

In the analysis, Logistic Regression Model was used for predicting crash types depending on factors like speed limit, age as well as day of the week, day of the year (Rahman et al., 2022). Finally, the result indicated in Table 7 depicted the confusion matrix that shows that the model has reasonably good prediction capability to correctly classify the crash types. The study was able to help identify the extent to which those factors impacted on crash outcomes, by the aid of the use of logistic regression model.

Random Forest

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_model <- randomForest(`Crash Type` ~ Age + data_clean$`Speed Limit`+ data_clean$`Day of week`+ data_clean$Gender + data_clean$`Time of day`,
                         data = data_clean, 
                         importance = TRUE)

print(rf_model)
## 
## Call:
##  randomForest(formula = `Crash Type` ~ Age + data_clean$`Speed Limit` +      data_clean$`Day of week` + data_clean$Gender + data_clean$`Time of day`,      data = data_clean, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 38.69%
## Confusion matrix:
##          Multiple Single class.error
## Multiple     1582   1404   0.4701942
## Single       1221   2577   0.3214850
rf_pred <- predict(rf_model, data_clean)
rf_confusion_matrix <- table(Predicted = rf_pred, Actual = data_clean$`Crash Type`)
rf_accuracy <- sum(diag(rf_confusion_matrix)) / sum(rf_confusion_matrix)

print(paste("Random Forest Accuracy: ", round(rf_accuracy, 4)))
## [1] "Random Forest Accuracy:  0.6557"

Random Forest was also utilized for classification to come up with better results because it generates numerous decision trees from which it gives high predictive accuracy. This proved that even though logistic regression has been a good model, random forest proved to be superior through ensemble learning. Risk evaluation for assessing high-risk parameters for road accident is essential for carrying out necessary measures for prevention or controlling the rate of occurrence.

6. Conclusion

The analysis revealed significant crash patterns, including the influence of road user categories, crash types, and contributing factors like speed and age. Studying these characteristics is very essential for policy formulation and enhancing the safety of roads for use. Nevertheless, one still gets the impression of a number of deficits that may have implications of data quality and, therefore, new opportunities for studying the dataset could be identified.

7. References

Abd Aziz, F. S., Abdullah, K. H., Harith, S. H., & Sofyan, D. (2022). Trends and evolution of road user behaviour research: A bibliometric review. International Journal of Information Science and Management (IJISM), 20(3), 69-93. Retrieved from: https://ijism.isc.ac/article_698416_7fc0b4dc64019ae20a0b3cfaa5f07e98.pdf [Retrieved on: 05.02.25]

Huang, R., Liu, H., Ma, H., Qiang, Y., Pan, K., Gou, X., … & Glowacz, A. (2022). Accident prevention analysis: Exploring the intellectual structure of a research field. Sustainability, 14(14), 8784. Retrieved from: https://www.mdpi.com/2071-1050/14/14/8784/pdf [Retrieved on: 05.02.25]

Nassiri, H., Mohammadpour, S. I., & Dahaghin, M. (2023). Forecasting time trend of road traffic crashes in Iran using the macro-scale traffic flow characteristics. Heliyon, 9(3). Retrieved from: https://scholar.google.com/scholar?output=instlink&q=info:ikUDDOBRllUJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2021&scillfp=13072865154789522004&oi=lle [Retrieved on: 02.05.25]

Rahman, M. M., Islam, M. K., Al-Shayeb, A., & Arifuzzaman, M. (2022). Towards sustainable road safety in Saudi Arabia: Exploring traffic accident causes associated with driving behavior using a Bayesian belief network. Sustainability, 14(10), 6315. Retrieved from: https://www.mdpi.com/2071-1050/14/10/6315/pdf [Retrieved on: 05.02.25]

Sangare, M., Gupta, S., Bouzefrane, S., Banerjee, S., & Muhlethaler, P. (2021). Exploring the forecasting approach for road accidents: Analytical measures with hybrid machine learning. Expert Systems with Applications, 167, 113855. Retrieved from: https://www.sciencedirect.com/science/article/am/pii/S0957417420306667 [Retrieved on: 05.02.25]

Tao, W., Aghaabbasi, M., Ali, M., Almaliki, A. H., Zainol, R., Almaliki, A. A., & Hussein, E. E. (2022). An advanced machine learning approach to predicting pedestrian fatality caused by road crashes: A step toward sustainable pedestrian safety. Sustainability, 14(4), 2436. Retrieved from: https://www.mdpi.com/2071-1050/14/4/2436/pdf [Retrieved on: 02.05.25]