Introduction

Transportation through private vehicles is an ever increasing market as technological advancements in supply chains and fuel technology to allow vehicle manufacturers to deploy cheaper and more robust vehicles. Allowing the average individual to afford some form of transportation. An increase in vehicle and license registration raises the question of road fatalities that occur on Australian roads each year (“Registration statistics (Department of Transport and Main Roads)”, 2020). This research paper aims to analyse if the number of fatalities can be predicted using historical observations through appropriate statistical methods. The results from this analysis can be used by road safety authorities as supplementary information when initiating safety initiatives For the purpose of this paper, we focus on Queensland road fatalities only.

Background

In Assessment 2, crash statistics from Queenslands Department of Transport and Main Roads authority were utilised to see how motorcycle riders are affected for below points:.

  1. What effect does the nature of a crash and road features have on fatalities in lower speed zones?
  2. What is the overall effect of rainfall on motorcyclist fatalities?
  3. How does rainfall affect the chance of a fatal crash on ‘featureless’ roads, as well as in certain speed zones?

Several interesting insights were drawn from our analysis. Using a logistic regression model, we noticed hitting an object, head-on collisions and featureless roads had the most significant impact on fatalities for motorcycle riders, whereas accidents around roundabouts and intersection had a lower chance of fatalities. Additionally, when rainfall was added as a variable to ‘featureless’ road crashes, the results showed a 27.25% chance of a fatality for an above average rainfall month and a 11.48% chance of a fatality for a below average rainfall month.

However, as the number of fatalities were significantly lower than the number of accidents, an imbalance in the dataset along with a lack of granularity amongst the variables restricted our ability in arriving at a definitive conclusion.

Objective

The objective of this research is to focus on all road users in Queensland and predict the number of fatalities for future years using a Time Series analysis. When looking at a simple moving average graph for all road users in Queensland, a downward trend can be seen after 2010 (Figure 1). The dataset includes crash statistics from 2001 to 2018. The results from this research will allow authorities to gain a better understanding of trends and how to target areas of need accordingly.

Purpose of this reasearch

Exploratory Data Analysis

An initial look at the time series data shows a downward trend post 2010 along with shifts due to seasonality (Figure 2). Further examining the seasonality through the subseries plot in Figure 3, an increase in average fatalities from February through to July can be seen as fatalities peak in August. The traditional holiday period from December to January shows an increase after the downward trend from September to October.

An additive decomposition in Figure 4 is plotted to understand the various components of the data. It further confirms the presence of a seasonal component through a repeating short-term cycle. The trend component shows fatalities showing a downward trend well before 2010. Furthermore, a slight increase can be seen 2011 onward, trailing of to the lowest levels post 2014. This seems to be in line with the changes highlighted in the Road Safety Action plan of 2008 by the Queensland government. They outline the implementation of several safety initiatives such as, improving road quality in high risk areas, installation of median crash barriers and the Motorcycle Safety Mass action program (Queensland Transport Department of Main Roads, 2008).

We use the auto-correlation function (ACF) to understand the relationship between our data points and their lagged values. The large spike at Lag 12 in Figure 5 shows patterns of seasonality or cyclicity as the data is represented in monthly observations. Furthermore, the PACF plot in Figure 6 helps us understand the correlations between residuals with lags instead of the present as performed by the ACF function. This can useful to highlight any obscure findings as these variations can be used as features when building our model in the next steps.

Furthermore, Figure 5 shows that none of the lags lie within the expected 95% limit. Further highlighting the non-stationary nature of the date. We perform the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to confirm if the information highlighted above is correct.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 2.1694 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test-statistic value of 2.169% which is significantly higher than the expected 1% critical value. Therefore, the null hypothesis is reject as the data is non-stationary in nature.

Modeling

To meet the pre-conditions of utilising the ARIMA function, the data needs to be stationary in nature. This is where the statistical properties like mean, variance and correlation are constant over a period of time (Palachy, 2020). We use the ‘ndiff()’ function to determine the number of differences required to make the data stationary (Hyndman & Athanasopoulos, 2018).

## [1] 1

As per the results above, one difference is required to make the data stationary. Since our initial observations showed a seasonal correlation, we use the ‘nsdiffs()’ function to see if a further seasonal difference step is required (Hyndman & Athanasopoulos, 2018).

## [1] 0
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.015 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The results show we do not require any further transformations to the data. By perform another KPSS unit test, we can see the t-statistic value to be lower than the 1% critical value, therefore we accept the null hypothesis and confirm that the data is now stationary.

We utilise the ACF and PACF plots again to determine the appropriate inputs for the ARIMA model. Figure 7 shows no significance in the lags as they are all within the critical limit. Figure 8 shows a geometric decay through the PACF plot, pointing to a MA process of the form MA(2)

## ARIMA Model

Hyndman & Athanasopoulos (2018) outline he Autoregressive Inegrated Moving Averages (ARIMA) model as a generalised prediction model which works well with stationary data. The model uses three mandatory parameters to predict the value of y. A non-seasonal model can be summarised as ARIMA(p,q,d):

Though the predictors can be picked manually using the ACF, PACF plots, we use the Auto Arima function to determine the most appropriate model for our data.

## Series: total_casualties_dff 
## ARIMA(0,0,1)(1,0,0)[12] with zero mean 
## 
## Coefficients:
##           ma1    sar1
##       -0.9152  0.1670
## s.e.   0.0335  0.0704
## 
## sigma^2 estimated as 30.24:  log likelihood=-671.56
## AIC=1349.11   AICc=1349.23   BIC=1359.22
## 
##  Box-Ljung test
## 
## data:  a$residuals
## X-squared = 6.4506, df = 12, p-value = 0.8917

The Auto ARIMA suggests ARIMA(0,0,1)(1,0,0)[12] as the most appropriate model. The model could produce statistically accurate forecasts for the following reasons:

Splitting the data into training and tests with a roughly 80/20 split. We see this model to have a RMSE value of 5.538474 with a MAPE value of 25.43131.

##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.1066244 5.538474 4.500426  -5.205723 18.84959 0.7742668
## Test set     -1.9447511 5.229442 4.167708 -17.180196 25.43131 0.7170250
##                    ACF1 Theil's U
## Training set 0.02518764        NA
## Test set     0.04375397  0.681873

We compare the ARIMA model with other standard models to validate if this is the best model to use with our data.The ‘Average’ model evaluates the results by equaling all future values o the averages of the historical data. Lastly, the ‘Drift’ method is a variation of he naive method where the amount of change is set to be the average change from the historical data. As per below, we see the ARIMA model to have he lowest value for RMSE and MSE amongst the other considered models.

##             RMSE      MAE     MAPE      MASE
## ARIMA   5.229442 4.167708 25.43131 0.7170250
## Average 7.460283 6.285256 39.14475 1.0813344
## Drift   5.365353 4.308602 21.94244 0.7412649

We continue with the ARIMA model and use the forecast package to predict fatalities for the next four years. The graph below shows the general trend fatalities will follow for the forecasted time period. Though a slight increase is predicted in the shorter term, the fatalities seem to plateau out moving further into the future. The averages also seem to be significantly less spread out than their historical counterparts

Predicting values for the next four years below, we can see 2019 is predicted to have an increased number of fatalities from the previous year. This trend continues until the end of 2022.

Reflection

The goal of this research paper was to predict the number of fatalities for the next four years using a Time Series model. We also wanted to see if any seasonal or cyclical patterns exist in the dataset. Both questions were answered as a seasonal pattern in the dataset was discovered. This could be due to the frequency pattern of the data as it is recorded each month for each year. Furthermore, the prediction model allows us to see the number of fatalities if no further initiatives are undertaken by the government. As per the latest Road Safety Action Plan published by the Queensland government, their aim is to reduce the number of fatalities to 0 (Queensland Transport Department of Main Roads, 2015). Though this is an ambitious plan, it is important for safety authorities to try initiatives that reduce the cost of human deaths on our roads. A reason for the predictions to be higher than historical could be due to not including the latest safety initiatives undertaken as features in our model. However, these results can be considered accurate based on historical observations.

Conclusion

The group assignment analyzed the impacts of various factors on motorcylce riders in Queensland between 2001 and 2018. This research paper looked at the overall fatalities caused on the States roads o see if any trends can be witnessed by looking at historical observations. We utilised statistical methods such as Autocorrelation, Partial autocorrelation, trends and root unit tests to determine if the available data was good enough to build a model upon. Upon recognising the non-stationary nature of the data, transformations were undertaken to make the data stationary. A prediction model was generated using the Auto arima function and it’s residuals were examined for a proper fit. Using the forecast package, we were able to predict total road fatalities for the next four years.

References

Hyndman, R., & Athanasopoulos, G. (2018). Forecasting: principles and practice (2nd ed.). OTexts.

Palachy, S. (2020). Stationarity in time series analysis. Medium. Retrieved 13 November 2020, from https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322.

Queensland Transport Department of Main Roads. (2008). Road Safety Action Plan. Retrieved from https://cabinet.qld.gov.au/documents/2008/Nov/Road%20Safety%20Action%20Plan/Attachments/Road_safety_action_plan_2008_2009_complete.pdf

Queensland Transport Department of Main Roads. (2015). Road Safety Action Plan.

Registration statistics (Department of Transport and Main Roads). Tmr.qld.gov.au. (2020). Retrieved 13 November 2020, from https://www.tmr.qld.gov.au/Safety/Transport-and-road-statistics/Registration-statistics.