First, split your time series into a training and a test set, such that you are training on approximately 80% of the data and testing on approximately 20% of the data. Visualize the training and test sets in a single plot.
Code
# Set the seed for reproducibilityset.seed(123)# Determine the number of rows for training (80%) and testing (20%)n_rows <-nrow(vehicle_sales_tbl_ts)train_rows <-round(0.8* n_rows)# Split the data into training and testing setstrain_data <- vehicle_sales_tbl_ts %>%slice(1:train_rows)test_data <- vehicle_sales_tbl_ts %>%slice((train_rows +1):n_rows)# Visualize the training and test setsggplot() +geom_line(data = train_data, aes(x = date, y = value, color ="Training"), linewidth =1) +geom_line(data = test_data, aes(x = date, y = value, color ="Test"), linewidth =1) +scale_color_manual(values =c("Training"="blue", "Test"="red")) +labs(title ="Training and Test Sets", x ="Date", y ="vehicle sales") +theme_minimal()
Does it appear that the test set is representative of the training set?
Yes, based on the plot showing both the test set and training set, it seems that the test set closely mirrors the trend observed in the training set. This indicates that the test data captures similar patterns and dynamics as the training data, suggesting it is representative of the broader dataset.
Section 2 - Cross-Validation Scheme
Next, set up a rolling window cross-validation scheme using stretch_tsibble. Be sure to make appropriate choices on the initial training period and the interval at which you will step through the data considering the length of your time series.
Code
vehicle_train_cv <- train_data %>%stretch_tsibble(.init =22*12, .step =12)vehicle_train_cv %>%ggplot()+geom_point(aes(date,factor(.id),color=factor(.id)))+ylab('Iteration')+ggtitle('Samples included in each CV Iteration')
Section 3 - Model Selection and Comparison
Fit the selected best ARIMA from Assignment 3 and a naive model to each fold in the cross-validation scheme you created. Then, produce a forecast of each model for each fold. Visualize the actual versus predicted for each cross-validation iteration. Does it seem like one model is likely to outperform the other?
vehicle_train_cv_forecast %>%group_by(.id,.model) %>%mutate(h =row_number()) %>%ungroup() %>%as_fable(response ="value", distribution = value) %>%accuracy(train_data, by =c("h", ".model")) %>%ggplot(aes(x = h, y = RMSE,color=.model)) +geom_point()+geom_line()+ylab('Average RMSE at Forecasting Intervals')+xlab('Months in the Future')
Code
vehicle_train_cv_forecast %>%group_by(.id,.model) %>%mutate(h =row_number()) %>%ungroup() %>%as_fable(response ="value", distribution = value) %>%accuracy(train_data, by =c("h", ".model")) %>%mutate(MAPE = MAPE/100) %>%# Rescaleggplot(aes(x = h, y = MAPE,color=.model)) +geom_point()+geom_line()+theme_bw()+scale_y_continuous(name ='Average MAPE at Forecasting Intervals',labels=scales::percent)
In a training set spanning 38 years with 460 data points of vehicle sales, the ARIMA model surpasses the naive model in predictive accuracy. Its ability to capture intricate temporal patterns makes it superior, highlighting the importance of leveraging sophisticated techniques for accurate time series forecasting.
.model .type ME RMSE MAE MPE MAPE MASE
1: arima Test -16.43561 91.05246 72.04965 -1.942046 5.877184 0.6443837
2: naive Test -27.22996 190.11396 152.43048 -4.486380 12.564695 1.3632781
RMSSE ACF1
1: 0.6161455 0.5334158
2: 1.2864876 0.2775713
The graph clearly demonstrates that the ARIMA model consistently outperforms the naive model across different forecast horizons. This conclusion is further supported by analyzing the RMSE and MAPE metrics presented in the accuracy table. The superior performance of the ARIMA model underscores its effectiveness in accurately predicting future values compared to the simplistic naive approach.
Section 4
After identifying the model that performed the best in Section 3, refit that model to the entire training set and produce a forecast for the test set. Visualize the actual versus predicted for the test set, and recalculate your performance metrics on the test set for this selected model.
While the forecast model effectively captures the variation and trend of the data, it falls short in accurately capturing the magnitude of the data. This suggests that while the model tracks the overall patterns well, it may underestimate the actual values, impacting its performance on the test set.
Code
# Fit the ARIMA model to the training datavehicle_model <- train_data %>%model(ARIMA(vehicle_sales ~pdq(2, 0, 2)) )# Extract the fitted values from the modeltrain_fitted <- vehicle_model %>%augment()# Convert date variable to Date class if it's not alreadytrain_fitted$date <-as.Date(train_fitted$date)train_data$date <-as.Date(train_data$date)# Plot the fitted values against the actual valuesggplot() +geom_line(data = train_data, aes(x = date, y = vehicle_sales), color ="blue", linetype ="solid", size =1, alpha =0.7) +geom_line(data = train_fitted, aes(x = date, y = .fitted), color ="red", linetype ="dashed", size =1) +labs(title ="Actual vs Fitted Values for Training Set", y ="Vehicle Sales", x ="Date") +theme_minimal()
When plotting the fitted values against the actual values, it’s apparent that the model exhibits low bias, suggesting a good fit to the training data. However, upon comparing the forecast with the test data, it becomes evident that the test forecast demonstrates low variance, indicating a potential overfitting issue. This discrepancy suggests that while the model captures the underlying patterns well, it may struggle to generalize to unseen data, warranting further investigation and potential adjustments to address the overfitting problem.