Traffic Accidents in Cincinnati

Assignment 3 - FORECASTING METHODS (BANA 7050)

Author

Tauseef Ahmed

Published

February 17, 2023

Section 1

Data selected for time series analysis is called Traffic Crash Reports sourced from Cincinnati Police Department’s website. Records in this dataset are generated whenever crashes are reported to the Cincinnati Police department (CPD). The report includes information for fatal, injury, or non-injury crashes.

Load and Analyze Data

Code
data %>%
    mutate(ma_12 = rollmean(value,k = 12, na.pad = TRUE)) %>%
    ggplot() +
    geom_line(aes(date,value)) +
    geom_line(aes(date,ma_12), color = 'blue')+
    theme_bw()

Above plot shows the number of traffic accidents which took place in Cincinnati from January 2013 till December 2022. The trend (blue line) has been plotted as moving average of k = 12.

Split Data into Train & Test sets

Before conducting further analysis, data will be split 80-20 into training and testing sets. Cut-off date to be used for the split is: December, 2020 (indicated by the red line).

Code
data %>%
    ggplot()+
    geom_line(aes(date,value))+
    geom_vline(aes(xintercept=ymd('2020-12-01')),color='red')+
    theme_bw()

Initial Prophet Model

The Facebook Prophet model is based on three main components - trend, seasonality, and holidays. This model is often used to forecast data with strong seasonality. The trend component can handle positive and negative changes. It uses a piece-wise growth curve to model non-periodic changes. The model uses seasonality component to take care of the seasonal effects occurring at daily, weekly, and/or yearly levels. Finally, holidays component takes care of special events affecting the time series.

For Cincinnati Traffic Accidents, we will now try to fit an initial Prophet model and view a 12-month forecast. Note that we are analyzing this time-series at the monthly level therefore holidays would not be considered in our models.

Code
prophet_data_full <- data %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = value)  # Have to name our time series "y"

prophet_data <- train %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = value)  # Have to name our time series "y"

prophet_data_test <- test %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = value)  # Have to name our time series "y"


orig_model = prophet(prophet_data) # Initial Model

orig_future = prophet::make_future_dataframe(orig_model, periods = 365) # Create future dataframe for predictions (in days)

orig_forecast = predict(orig_model,orig_future) # Get forecast

plot(orig_model,orig_forecast) + 
  ylab("No. of Crashes") + xlab("Month") + theme_bw()

Just by looking at this forecast, it does not seem to be doing a good job since the variations are going beyond the data’s highest and lowest values quite frequently.

Section 2

To study further, we would now conduct a time-series decomposition.

Time-Series Decomposition

Code
prophet::prophet_plot_components(orig_model,orig_forecast)

The Trend plot shows that our time-series data moved in an upward direction till the end of 2016 before starting a downward journey. It is worth noting that the upward movement was made at a higher pace than the downward movement.

From the yearly plot, we can make interpretations about possible seasonality. Some peaks and troughs are detected in certain months which indicate that seasonality might be present in this series pending further investigation.

Changepoints Detection

To detect critical changepoints in our time-series trend, we will first plot automated changepoint detection:

Code
plot(orig_model,orig_forecast) +
  add_changepoints_to_plot(orig_model) + 
  theme_bw() + 
  xlab("Month/Year") + 
  ylab("No. of Crashes")

Above plot shows that our time-series can be bifurcated into two periods with varying trends separated by a single changepoint located at the end of 2016. To verify this we can add more changepoints and interpret if that is more appropriate for this time-series.

Code

# Number of Changepoints
model = prophet(prophet_data, changepoint.range = 0.99)

forecast = predict(model,orig_future)

plot(model, forecast) + 
  add_changepoints_to_plot(model) + 
  theme_bw() + 
  xlab("Month/Year") + 
  ylab("No. of Crashes")

To force a second changepoint, the value for changepoint.range argument was set to 0.99 (highest possible). Even with this value, there are only two changepoints showing in the plot. Visually assessing the trend for our time-series, it seems to be more appropriate to include only one changepoint. The second changepoint does not add much to the understanding of our trend.

With the current information on this series’ trend, the most appropriate model would be a logistic model. Since the trend has different directional movements and at different pace, a linear model might not be appropriate for this series. The first part of our trend is moving upward at a higher pace (larger slope) compared to the second part of the trend which is moving downward at a lower pace (smaller slope). The trend therefore is likely better described by a logistic model.

Saturation Point

While determining a saturation maximum would be difficult, it can be said with certainty that a minimum saturation point should be defined as zero. Since crashes wouldn’t be defined in negative values, a floor of zero should keep our forecasts real. For the cap value, we look at the highest number of crashes that occurred in a single month throughout the 10-year history i.e. 3,502. Since it is perfectly acceptable that this maximum is breached sometime in the future, we can assume that providing a forecast beyond 4,500 might not be appropriate. Therefore, we define maximum saturation point as 4,500.

Let’s visualize a one-year prophet forecast with trend defined as logistic along with a saturation floor at zero and cap at 4,500.

Code
one_yr_future = make_future_dataframe(orig_model,periods = 365)
one_yr_forecast = predict(orig_model, one_yr_future)

# Set "floor" in training data
prophet_data$floor <- 0
prophet_data$cap <- 4500
orig_future$floor <- 0
orig_future$cap <- 4500

# Set floor in forecast data
one_yr_future$floor <- 0
one_yr_future$cap <- 4500
sat_model <- prophet(prophet_data, growth = "logistic")

sat_one_yr_forecast = predict(sat_model, one_yr_future)

plot(sat_model, sat_one_yr_forecast) +
  ylim(0,4500) +
  theme_bw() +
  xlab("Month/Year") +
  ylab("No. of Crashes")

Above plot is slightly better the our initial model because it takes care of the highs and lows. However, high variations seems to have persisted and the forecasts are getting close to the boundaries too often. This will need to be investigated further.

Section 3

Earlier, our prophet model was showing some seasonality at yearly level. To investigate this, we will assess additive and multiplicative seasonality for our data.

Additive Seasonality

Code
additive <- prophet(prophet_data)
add_fcst <- predict(additive, orig_future)

# plot(additive, add_fcst) +
#   ylim(0,4500)

prophet_plot_components(additive,add_fcst)

Looking at the additive seasonality plot, it seems that seasonality is present in general because some months are clearly higher and lower than others within a year. However, we’ll need to compare this with multiplicative seasonality to determine which one best describes our time-series.

Multiplicative Seasonality

Code
multi <- prophet(prophet_data, seasonality.mode = 'multiplicative')
multi_fcst <- predict(multi,orig_future)

# plot(multi, multi_fcst) + 
#   ylim(0,4500)

prophet_plot_components(multi, multi_fcst)

Time series having a multiplicative seasonality usually contain differently sized peaks and troughs. Looking at above plot, we can see that this is not the case with our time-series. It looks like we have the similar-sized peaks and troughs therefore it is assumed that we have an additive case of time series.

Final Prophet Model

Code
prophet_data$floor <- 0
prophet_data$cap <- 4500

final_model = prophet(prophet_data, 
                      seasonality.mode = "additive",
                      growth = "logistic",
                      n_changepoints = 1,
                      fit = FALSE)
final_model = fit.prophet(final_model,prophet_data)

final_future = prophet::make_future_dataframe(final_model, periods = 730) 

final_future$floor <- 0
final_future$cap <- 4500

final_forecast = predict(final_model, final_future)

##############
forecast_plot_data = final_forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds >= ymd("2021-01-01"))

ggplot() +
  geom_line(aes(prophet_data_test$ds,prophet_data_test$y)) +
  geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red') +
  theme_bw() +
  ylab('Actual vs Predicted (Red)') +
  xlab('Month/Year')

Final Prophet model has been defined with additive seasonality, logistic growth, and one changepoint without holidays. This plot compares the forecast (red) with the actual values in our test dataset. Visually comparing the two, it seems our final model has too much variation. Therefore we will need to assess the evaluation metric to see if we can adjust the model somewhere.

Section 4

Assessing Model Performance

Out of Sample RMSE/MAE/MAPE

Code
forecast_metric_data = final_forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds >= ymd("2021-01-01"))


RMSE <- sqrt(mean((prophet_data_test$y - forecast_metric_data$yhat)^2))

MAE <- mean(abs(prophet_data_test$y - forecast_metric_data$yhat))

MAPE <- mean(abs((prophet_data_test$y - forecast_metric_data$yhat)/prophet_data_test$y))

print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 1182.11"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 916.88"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.35"
Metric Value
RMSE 1182.1053518
MAE 916.8826464
MAPE 0.3501071

Utilizing our test set (Jan-21 to Dec-22), we obtain values for evaluation metrics. Even though there is no existing benchmark for these values, a lower value is indicative of a better model. At first glance RMSE and MAE seems to be on a a higher side but the value for MAPE is still encouraging. It is better to compare these values by fitting alternate models.

Rolling Window Cross Validation

Code
df.cv <- cross_validation(final_model, initial = 2525, period = 365, horizon = 365, 
                          units = 'days')

metrics_tbl <- performance_metrics(df.cv)

metrics_tbl %>%
  kbl() %>%
  kable_paper("hover", full_width = F)
horizon mse rmse mae mape mdape smape coverage
30 days 338439.1 581.7552 581.7552 0.2230657 0.2230657 0.2006829 0
61 days 183708.3 428.6120 428.6120 0.1571735 0.1571735 0.1457217 1
90 days 2353079.1 1533.9749 1533.9749 0.8377799 0.8377799 0.5904474 0
121 days 4227106.9 2055.9929 2055.9929 1.6024886 1.6024886 0.8896565 0
151 days 2046809.2 1430.6674 1430.6674 0.6855138 0.6855138 0.5105271 0
182 days 514143.4 717.0379 717.0379 0.2891282 0.2891282 0.2526099 0
212 days 150655.7 388.1440 388.1440 0.1381295 0.1381295 0.1292059 1
243 days 674335.1 821.1791 821.1791 0.3216526 0.3216526 0.2770893 0
274 days 781876.3 884.2377 884.2377 0.3410095 0.3410095 0.2913354 0
304 days 341150.6 584.0810 584.0810 0.1944344 0.1944344 0.1772069 0
335 days 725571.5 851.8049 851.8049 0.3258626 0.3258626 0.2802080 0
365 days 779909.0 883.1246 883.1246 0.3631269 0.3631269 0.3073275 0
Code

plt1 <- plot_cross_validation_metric(df.cv, metric = 'rmse', rolling_window = 0.25) +
  ylim(0,4500)

plt2 <- plot_cross_validation_metric(df.cv, metric = 'mape', rolling_window = 0.25) +
  ylim(0,10)

par(mfrow=c(1,2))
 plt1 + plt2

Both the visuals for MAPE and RMSE indicate that our model may just contain the right balance (neither over fits nor under fits).

Model Comparison

In this section, we’ll compare prophet models across three different factors:

1) Seasonality: Additive or Multiplicative

2) Growth: Linear or Logistic

3) Changepoints: One or More

Additive vs Multiplicative

Code
add_model = prophet(prophet_data, seasonality.mode = 'additive')
df_cv1 <- cross_validation(add_model, initial = 2525, period = 365, 
                           horizon = 365, units = 'days')
metrics1 = performance_metrics(df_cv1) %>% 
  mutate(model = 'Additive')

mult_model = prophet(prophet_data, seasonality.mode = 'multiplicative')
df_cv2 <- cross_validation(mult_model, initial = 2525, period = 365, 
                           horizon = 365, units = 'days')
metrics2 = performance_metrics(df_cv2) %>% 
  mutate(model = "Multiplicative")

#Plots 
plt1 <- metrics1 %>%
  bind_rows(metrics2) %>%
  ggplot( ) +
  geom_line(aes(horizon, rmse, color = model)) +
  theme(legend.position = "top")

plt2 <- metrics1 %>%
  bind_rows(metrics2) %>%
  ggplot() +
  geom_line(aes(horizon, mape, color = model)) +
  theme(legend.position = "top")

par(mfrow=c(1,2))
 plt1 + plt2

Both RMSE and MAPE perform better when defined with additive seasonality. Our selection for the final model in section 3 will remain unchanged with regards to seasonality factor.

Linear vs Logistic

Code
lin_model = prophet(prophet_data, seasonality.mode = 'additive', 
                    growth = 'linear')
df_cv1 <- cross_validation(lin_model, initial = 2525, period = 365, 
                           horizon = 365, units = 'days')
metrics1 = performance_metrics(df_cv1) %>% 
  mutate(model = 'Linear')

logis_model = prophet(prophet_data, seasonality.mode = 'additive', 
                     growth = 'logistic')
df_cv2 <- cross_validation(logis_model, initial = 2525, period = 365, 
                           horizon = 365, units = 'days')
metrics2 = performance_metrics(df_cv2) %>% 
  mutate(model = "Logistic")

plt1 <- metrics1 %>%
  bind_rows(metrics2) %>%
  ggplot( ) +
  geom_line(aes(horizon, rmse, color = model)) +
  theme(legend.position = "top")

plt2 <- metrics1 %>%
  bind_rows(metrics2) %>%
  ggplot() +
  geom_line(aes(horizon, mape, color = model)) +
  theme(legend.position = "top")

par(mfrow=c(1,2))
 plt1 + plt2

Given that our seasonality is additive, now we will compare which one from linear or logistic describes our series better. Looking at the plots, both RMSE and MAPE very clearly perform better with linear growth model. In section 3 we opted for a logistic growth model but looking at these evaluation metrics this will now be updated for the best model.

1-Changepoint vs 2-Changepoints

Code
cp1_model = prophet(prophet_data, seasonality.mode = 'additive', 
                    growth = 'linear')
df_cv1 <- cross_validation(cp1_model, initial = 2525, period = 365, 
                           horizon = 365, units = 'days')
metrics1 = performance_metrics(df_cv1) %>% 
  mutate(model = 'One CP')

cp2_model = prophet(prophet_data, seasonality.mode = 'additive', 
                     growth = 'linear', n.changepoints = 2)
df_cv2 <- cross_validation(cp2_model, initial = 2525, period = 365, 
                           horizon = 365, units = 'days')
metrics2 = performance_metrics(df_cv2) %>% 
  mutate(model = "Two CP")

plt1 <- metrics1 %>%
  bind_rows(metrics2) %>%
  ggplot( ) +
  geom_line(aes(horizon, rmse, color = model)) +
  theme(legend.position = "top")

plt2 <- metrics1 %>%
  bind_rows(metrics2) %>%
  ggplot() +
  geom_line(aes(horizon, mape, color = model)) +
  theme(legend.position = "top")

par(mfrow=c(1,2))
 plt1 + plt2

Earlier we had selected only one changepoint and our assumptions have been validated by above plots. Both the evaluation metrics perform better throughout with one-changepoint in the model.

Six-month forecast with Best-fit Prophet Model

Below is the forecast derived from the prophet model with additive seasonality, linear growth model, and one changepoint as its features.

Code
prophet_data_full$floor <- 0
prophet_data_full$cap <- 4500

best_model = prophet(prophet_data_full, 
                      seasonality.mode = "additive",
                      growth = "linear",
                      n_changepoints = 1,
                      fit = FALSE)
best_model = fit.prophet(best_model, prophet_data_full)

best_future = prophet::make_future_dataframe(best_model, periods = 180) 

best_future$floor <- 0
best_future$cap <- 4500

best_forecast = predict(best_model, best_future)

#Plot
plot(best_model,best_forecast) + 
  ylab("No. of Crashes") + xlab("Month/Year") + theme_bw()

This forecast is better than the one we charted earlier in section 3. The forecasts remain well within the saturation points we defined. The variation that initially existed has settled down (it has become less wobbly). Comparing the forecasts with its historical trend, it seems more believable and realistic. One thing which could still be improved upon is the fluctuation shown within the six month period - a consideration when fitting future forecast models.