The provided time series dataset encompasses monthly records of vehicle sales, spanning from January 1976 to November 2023. The dataset comprises a total of 574 rows, each corresponding to a specific month within this timeframe. It is important to note that the dates specified represent the first day of each month. Since the total sales for a given month are typically unavailable until the month concludes, it is inferred that the figures mentioned here pertain to the preceding month’s sales data. Also, it is not clear that these sales figures are of a particular dealer or OEM or a geographical location. As per my experience in working in automotive sector the vehicle sales are majorly affected by two factors one is the seasonality and the other is economy. The given data set acts as a proof to this theory where you can see the degrowth in sales data when ever there was an economic slowdown. Predicting this variable poses challenges given its intricate nature, susceptibility to external disruptions, and the ever-evolving dynamics within the Automotive Industry. The industry’s reliance on external factors such as raw materials, technology, and spare parts adds to the complexity. While historical patterns and market research contribute to forecasting accuracy, the diverse array of influences makes it a captivating yet intricate variable to anticipate.
------
summary(data)
date vehicle_sales
Length:575 Min. : 670.5
Class :character 1st Qu.:1117.2
Mode :character Median :1268.5
Mean :1261.7
3rd Qu.:1420.2
Max. :1845.7
str(data)
'data.frame': 575 obs. of 2 variables:
$ date : chr "1976-01-01" "1976-02-01" "1976-03-01" "1976-04-01" ...
$ vehicle_sales: num 885 995 1244 1191 1203 ...
print('Sd of Vehicle data:')
[1] "Sd of Vehicle data:"
sd(data$vehicle_sales)
[1] 221.9962
date <-as.Date(data$date, format ="%Y-%m-%d")str(date)
ggplot() +geom_line(aes(x = date, y = data$vehicle_sales)) +geom_smooth(aes(date, data$vehicle_sales), method ="lm", color ="red") +labs(x ="Month", y ="Vehicle Sales", title ="Monthly Vehicle Sales Over Time")
`geom_smooth()` using formula = 'y ~ x'
---Summary Stats—
The time series of vehicle sales reveals distinct seasonality, marked by consistent peaks and troughs, and volatility, evident through a notable standard deviation. The data exhibits multimodality, indicating diverse sales patterns or regimes. The close alignment of mean and median sales figures implies a symmetric distribution around a central level of sales. These characteristics suggest that while seasonal trends may be predictable, the influence of economic events and policy changes adds complexity to forecasting endeavors.
------
boxplot(data$vehicle_sales)
---Boxplot—
The boxplot representation of vehicle sales illustrates a relatively tight interquartile range, signifying that the central 50% of sales data is closely grouped within a narrow range of values. The extended tails (whiskers) on both sides indicate the existence of periods with notably lower and higher sales, portraying these as outliers compared to the predominant data. The balanced structure of the box and whiskers suggests an approximately symmetrical distribution of data around the median, without a distinct skew.
------
hist(data$vehicle_sales, freq =FALSE)lines(density(data$vehicle_sales), lwd =3, col ="red")
---Histogram—
The histogram depicting vehicle sales reveals a multimodal distribution characterized by multiple peaks.
-------
---Section 2---
From the preliminary exploratory analysis, it’s apparent that the dataset consists of two columns: one indicating the date, recorded on a monthly basis (in character format), and the other representing the corresponding vehicle sales for each month (numerical data). The sales figures vary between a minimum of 670 and a maximum of 1845 but the vehicle sales is between 1100 and 1400 during majority of the months. Notably, the sales records exhibit decimal values, suggesting a potential averaging of monthly data. Further investigation into the line graph reveals significant fluctuations in sales, attributable to factors such as seasonality and economic influences. There are four major dips in the vehicle sales all of which are in the same time period where there was huge drop in the economy(Early 1980s Recession, Early 1990s Recession, Global financial crisis(2008) and the covid) While additional factors may contribute, these two appear to be the primary drivers. When fitting a trend line to the plot, an overall positive trend emerges, indicating a progressive increase in vehicle sales over the observed period.
------
##6 Month averageplot(date, data$vehicle_sales, type ="l", col ="blue", xlab ="Date", ylab ="Vehicle Sales", main ="Vehicle Sales Over Time")# Calculate and add the 3-month moving averagelines(date, zoo::rollmean(data$vehicle_sales, 6, fill =NA), col ="red", lwd =2)# Add legendlegend("topright", legend =c("Vehicle Sales", "6-Month Moving Average"), col =c("blue", "red"), lwd =c(1, 2), cex =0.4)
##12 Month averageplot(date, data$vehicle_sales, type ="l", col ="blue", xlab ="Date", ylab ="Vehicle Sales", main ="Vehicle Sales Over Time")# Calculate and add the 3-month moving averagelines(date, zoo::rollmean(data$vehicle_sales, 12, fill =NA), col ="red", lwd =2)# Add legendlegend("topright", legend =c("Vehicle Sales", "12-Month Moving Average"), col =c("blue", "red"), lwd =c(1, 2), cex =0.4)
---Moving Average—
Two moving averages have been plotted in the graph, one with a 6-month window and the other with a 12-month window. The 6-month moving average exhibits a higher sensitivity to fluctuations and noise in the time series, making it less suitable for effectively representing the underlying trend. In contrast, the 12-month moving average provides a smoother and more stable representation, making it the preferred choice for capturing the overall trend in the data.
# Calculate the remainder series by subtracting the moving average from the original seriesdata$Remainder <- data$vehicle_sales -rollmean(data$vehicle_sales, 12, fill =NA)# Plot the remainder seriesggplot(data, aes(x = date, group =1)) +geom_line(aes(y = Remainder), colour ="blue", na.rm =TRUE) +labs(title ="Remainder Series after Removing Moving Average", x ="Date", y ="Remainder") +theme_minimal()
---Seasonality—
Subtracting the moving average from the original time series unveils fluctuations that escape the smoothing effect of the moving average. These residual patterns could signify short-term irregularities, noise, or intricate seasonal variations that do not align with the chosen period of the moving average. The observed patterns in the remainder series hint at the presence of additional seasonality with a frequency different from the window size of the applied moving average. Irregular spikes or drops may indicate specific events or anomalies within the data.
------
if ("vehicle_sales"%in%names(data) &&"date"%in%names(data) &&nrow(data) >0) { data_ts <-ts(data$vehicle_sales, frequency =12, start =c(year(min(data$date)), month(min(data$date))))#STL decomposition data_stl <-stl(data_ts, s.window ="periodic")# Plot the componentsautoplot(data_stl) +labs(title ="STL Decomposition of Vehicle Sales")} else {stop("The 'Sales' and/or 'Date' column is missing, or there is no data in the dataframe.")}
---Section 3---
The trend line displays fluctuations, signifying shifts in the underlying sales levels over the long term. A distinct and consistent repeating pattern in the seasonal component is evident, pointing to a robust seasonal influence on sales. The noticeable amplitude of the seasonal component implies that seasonality has a substantial impact on sales trends. Comparatively, the remainder or residuals are relatively small in scale when contrasted with the observed seasonal fluctuations. This suggests that the STL decomposition has effectively captured the majority of systematic behavior inherent in the data. The prominent seasonality aligns with expectations for vehicle sales data, where seasonal trends are influenced by factors such as new model releases, end-of-year sales, and economic cycles. The STL decomposition underscores the importance of seasonality as a significant feature in this time series, essential for consideration in any forecasting models.
------
#Naive seasonal forecastnaive_seasonal_forecast <-snaive(data_ts, h =6)#Naive forecast with driftnaive_drift_forecast <-rwf(data_ts, h =6, drift =TRUE)#Original time series and the forecastsautoplot(data_ts) +autolayer(naive_seasonal_forecast, series ="Naive Seasonal Forecast", PI =FALSE) +autolayer(naive_drift_forecast, series ="Naive Drift Forecast", PI =FALSE) +labs(title ="Naive Forecasts for Vehicle Sales", x ="Time", y ="Sales") +guides(colour=guide_legend(title="Legend"), position ='topright') +theme_minimal()
---Section 4---
When choosing between the naive seasonal forecast and the naive forecast with drift, the decision hinges on the visibility of the trend in your time series. If a clear and consistent trend is apparent, opting for the drift method is justified. Conversely, if the trend is ambiguous or the focus is on capturing seasonality, the naive seasonal method is more suitable.
To assess the effectiveness of the naive forecast, the forecasted values are to be compared with the actual values (but in this case we do not have them) or evaluate how well the forecast aligns with the known patterns in the historical data. Given the identified strong seasonality, the naive seasonal forecast is expected to perform well in capturing the data’s behavior, especially for the seasonal component. However, if the historical data reflects a relatively stable level or an inconsistent trend, a straightforward naive seasonal forecast without drift might yield more accurate results.