Section 1 - Exploratory Data Analysis and Time Series Decomposition
Data Source and Description
Code
vehicle_df <-read.csv('/Users/lasyapinnamaneni/Desktop/MS BANA/Time series/total_vehicle_sales.csv')#head(vehicle_df)#str(vehicle_df)vehicle_df <- vehicle_df %>%mutate(date =yearmonth(date)) %>%as_tsibble(index = date)# vehicle_df$date <- ymd(vehicle_df$date)# Set the seed for reproducibilityset.seed(123)# Determine the number of rows for training (80%) and testing (20%)n_rows <-nrow(vehicle_df)train_rows <-round(0.8* n_rows)# Split the data into training and testing setsvehicle_train <- vehicle_df %>%slice(1:train_rows)vehicle_test <- vehicle_df %>%slice((train_rows +1):n_rows)# Visualize the training and test setsggplot() +geom_line(data = vehicle_train, aes(x = date, y = vehicle_sales, color ="Training"), linewidth =1) +geom_line(data = vehicle_test, aes(x = date, y = vehicle_sales, color ="Test"), linewidth =1) +scale_color_manual(values =c("Training"="blue", "Test"="red")) +labs(title ="Training and Test Sets", x ="Date", y ="vehicle sales") +theme_minimal()
The dataset represents the total monthly vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data is subject to monthly updates and serves as a pivotal economic indicator, capturing variations in consumer spending patterns and trends within the automotive industry. Spanning multiple years, the dataset provides a comprehensive perspective on the dynamics of vehicle sales, making it a valuable resource for analyzing long-term trends and patterns in the market.
Summary Statistics
Code
# Calculate summary statisticssummary_stats <-summary(vehicle_train$vehicle_sales)# Create a tablesummary_table <-data.frame(Metric =c("Number of Observations", "Mean", "Median", "Mode", "Standard Deviation", "Range", "Min", "Max"),Value =c(length(vehicle_df$vehicle_sales), mean(vehicle_df$vehicle_sales), median(vehicle_df$vehicle_sales),table(vehicle_df$vehicle_sales)[which.max(table(vehicle_df$vehicle_sales))],sd(vehicle_df$vehicle_sales), diff(range(vehicle_df$vehicle_sales)),min(vehicle_df$vehicle_sales), max(vehicle_df$vehicle_sales) ) )# Print the tableprint(summary_table)
Metric Value
1 Number of Observations 575.0000
2 Mean 1261.7437
3 Median 1268.4550
4 Mode 2.0000
5 Standard Deviation 221.9962
6 Range 1175.2470
7 Min 670.4660
8 Max 1845.7130
Distribution of Time Series
Code
# Histogramhistogram <-ggplot(vehicle_train, aes(x = vehicle_sales)) +geom_histogram(fill ="skyblue", color ="blue", bins =20) +labs(title ="Histogram of Avg vehicle sales",x ="vehicle sales",y ="Frequency")# Density plotdensity <-ggplot(vehicle_train, aes(x = vehicle_sales)) +geom_density(fill ="skyblue", color ="blue") +labs(title ="Density Plot of Avg vehicle sales",x ="vehicle sales",y ="Density")# Boxplotboxplot <-ggplot(vehicle_train, aes(y = vehicle_sales)) +geom_boxplot(fill ="skyblue", color ="blue") +labs(title ="Boxplot of Avg vehicle sales",x ="",y ="vehicle sales")# Violin plotviolin <-ggplot(vehicle_train, aes(x ="", y = vehicle_sales)) +# Specify x aestheticgeom_violin(fill ="skyblue", color ="blue") +labs(title ="Violin Plot of Avg vehicle sales",x ="",y ="vehicle sales")# Arrange plots in a 2x2 gridggarrange(histogram, density, boxplot, violin, ncol =2, nrow =2)
The vehicles sales seem to be distributed fairly normal without any outliers in the data set.
The density curve representing vehicle sales illustrates a bimodal pattern characterized by two distinct peaks. Furthermore, the distribution appears symmetric, suggesting that the mean is equal to the median.
The histogram depicting vehicle sales reveals a bimodal distribution characterized by dual peaks.
In the boxplot, there are no outliers detected, and a symmetrical distribution is evident.
Moving Average of Time Series
Code
vehicle_ma <- vehicle_train %>%mutate(vehicle_ma_03 = zoo::rollmean(vehicle_train$vehicle_sales, k =3, fill =NA),vehicle_ma_06 = zoo::rollmean(vehicle_train$vehicle_sales, k =6, fill =NA),vehicle_ma_12 = zoo::rollmean(vehicle_train$vehicle_sales, k =12, fill =NA),vehicle_ma_24 = zoo::rollmean(vehicle_train$vehicle_sales, k =24, fill =NA))vehicle_ma_pivot <- vehicle_ma%>%pivot_longer(cols = vehicle_ma_03:vehicle_ma_24,values_to ="value_ma",names_to ="ma_order" ) %>%mutate(ma_order =factor( ma_order,levels =c("vehicle_ma_03","vehicle_ma_06","vehicle_ma_12","vehicle_ma_24" ),labels =c("vehicle_ma_03","vehicle_ma_06","vehicle_ma_12","vehicle_ma_24" ) ))vehicle_ma_pivot %>%mutate(ma_order =case_when( ma_order=='vehicle_ma_03'~'03rd Order', ma_order=='vehicle_ma_06'~'06th Order', ma_order=='vehicle_ma_12'~'12th Order', ma_order=='vehicle_ma_24'~'24th Order', ) )%>%ggplot() +geom_line(aes(date, vehicle_sales), size =1) +geom_line(aes(date, value_ma, color = ma_order), size =1) +scale_color_discrete(name ='MA Order')+theme_bw()+ylab('vehicle sales')+xlab("Time")
Code
vehicle_ma %>%ggplot() +ggtitle("Twenty Fourth Order Moving Average Estimates")+geom_line(aes(date, vehicle_sales), size =1) +geom_line(aes(date, vehicle_ma_24, color ='24th order'), size =1) +scale_color_discrete(name ='MA Estimates')+theme_bw()+ylab('vehicle sales')+xlab("Time")
The moving average (MA) calculated over a 24-month period effectively captures the seasonal patterns.
After applying STL decomposition to the dataset, we discovered a clear seasonal pattern that persists even after eliminating the overall trend using a moving average. Notably, we identified a recurring yearly cycle characterized by peaks followed by troughs, followed by a period of increased activity lasting 2-3 months, and then another downturn before the onset of a new seasonal cycle. Given the observed dates and the structure of the pattern, it’s reasonable to conclude that this time series is heavily influenced by the financial year and holidays.
Section 2 - ARIMA Modeling
Rolling SD of Differenced Series
Code
vehicle_diff <- vehicle_train %>%mutate(value_diff = vehicle_sales -lag(vehicle_sales)) %>%as_tsibble(index=date)vehicle_diff %>%mutate(diff_sd = zoo::rollapply( vehicle_sales, FUN = sd, width =12, fill =NA)) %>%ggplot()+geom_line(aes(date,diff_sd))+geom_smooth(aes(date,diff_sd),method='lm',se=F)+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='pink', color ='brown'))+ggtitle("Std Dev of Differenced vehicle sales over Time") +ylab("Std Dev of Differenced vehicle sales") +xlab("Year")
`geom_smooth()` using formula = 'y ~ x'
Despite noticing a slight decreasing trend in variance over time from the rolling Standard Deviation plot, we uphold our conclusion that there’s no need for correction for variance non-stationarity. This decision is based on the slow rate of change in the trend and the considerable variation observed across the years.
Seasonal Differenced
Given the presence of seasonality in the vehicle sales data, our next step involves examining the seasonal variation. We will employ a yearly difference of one unit, followed by a one-lag operation, as the seasonality persisted even after a one-year difference.
Testing the original sales data yielded a very small p-value, prompting us to reject the hypothesis of mean stationarity. However, after differencing to eliminate seasonality and trend, the KPSS Test produced a p-value of 0.10, exceeding our threshold of 0.05. This outcome confirms that the transformed time series is mean-stationary, indicating that no additional transformations are needed.
The ACF plot indicates a moving average (MA) of order 2, while the significant lag in the PACF plot is also 2, implying the determination of the autoregressive (AR) order.
Based on the data from the lag plots and stationary we can assume ARIMA(2,0,2)
The Box test results in a p-value lower than the significance level of 0.05, indicating that we reject the null hypothesis.
Section 3 - Meta Prophet Model
Decomposing Using Prophet Model
Code
fit <- vehicle_train %>%model(prophet = fable.prophet::prophet(vehicle_sales))components(fit) %>%autoplot()
The decomposition of the time series data reveals an upward trend until early 2000, followed by a downward slope thereafter. Additionally, an additive seasonality component is evident, with no indication of any multiplicative seasonality present.
.model .type ME RMSE MAE MPE MAPE
1: arima Test 2.135675 114.6195 85.49483 -0.4589264 7.057158
2: naive_w_seasonality Test 14.625080 144.8972 108.19676 0.5234692 9.137156
3: prophet Test -8.354547 202.8373 163.31542 -1.8428421 14.002879
MASE RMSSE ACF1
1: 0.7646321 0.7756221 0.4196724
2: 0.9676692 0.9805090 0.5811537
3: 1.4606287 1.3725858 0.7982903
The ARIMA model achieved least RMSE score of 114.6195 on the training data. This indicates that, on average, its predictions deviate by 114.6195 units from the actual values in the training set.
Below we can check the RMSE of all models in T+1, T+2…
Joining with `by = join_by(date)`
`summarise()` has grouped output by 'months_ahead'. You can override using the
`.groups` argument.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
While the forecast model effectively captures the variation and trend of the data, it falls short in accurately capturing the magnitude of the data. This suggests that while the model tracks the overall patterns well, it may underestimate the actual values, impacting its performance on the test set. ARIMA model has lowest error rate among all the models when compared with RMSE or MAPE
12 Months Forecast on complete data using best model
To gauge the model’s fit, I assessed its bias by analyzing the alignment between the model’s forecasts and the training data. Plotting the actual versus forecasted values revealed a pronounced bias within the training dataset. Furthermore, comparison with the test data indicated a limited variance, suggesting the presence of underfitting.