Assignment 2 Forecasting

Author

Sai Lohith Roy Valleri

#install.packages("slider")
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(slider)

Warning: package 'slider' was built under R version 4.3.2

library(tidyr)

Warning: package 'tidyr' was built under R version 4.3.2

#install.packages("warp")
library(readxl)

Warning: package 'readxl' was built under R version 4.3.2

library(zoo)


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(forecast)

Warning: package 'forecast' was built under R version 4.3.2

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(lubridate)

Warning: package 'lubridate' was built under R version 4.3.2


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

data <- read.csv("total_vehicle_sales.csv", header = TRUE)

head(data, 10)

         date vehicle_sales
1  1976-01-01         885.2
2  1976-02-01         994.7
3  1976-03-01        1243.6
4  1976-04-01        1191.2
5  1976-05-01        1203.2
6  1976-06-01        1254.7
7  1976-07-01        1162.3
8  1976-08-01        1026.1
9  1976-09-01        1057.9
10 1976-10-01        1129.4

tail(data, 10)

          date vehicle_sales
566 2023-02-01      1175.543
567 2023-03-01      1419.927
568 2023-04-01      1400.163
569 2023-05-01      1409.258
570 2023-06-01      1414.980
571 2023-07-01      1339.949
572 2023-08-01      1364.907
573 2023-09-01      1373.627
574 2023-10-01      1237.431
575 2023-11-01      1256.699

---Section 1---

The provided time series dataset encompasses monthly records of vehicle sales, spanning from January 1976 to November 2023. The dataset comprises a total of 574 rows, each corresponding to a specific month within this timeframe. It is important to note that the dates specified represent the first day of each month. Since the total sales for a given month are typically unavailable until the month concludes, it is inferred that the figures mentioned here pertain to the preceding month’s sales data. Also, it is not clear that these sales figures are of a particular dealer or OEM or a geographical location. As per my experience in working in automotive sector the vehicle sales are majorly affected by two factors one is the seasonality and the other is economy. The given data set acts as a proof to this theory where you can see the degrowth in sales data when ever there was an economic slowdown. Predicting this variable poses challenges given its intricate nature, susceptibility to external disruptions, and the ever-evolving dynamics within the Automotive Industry. The industry’s reliance on external factors such as raw materials, technology, and spare parts adds to the complexity. While historical patterns and market research contribute to forecasting accuracy, the diverse array of influences makes it a captivating yet intricate variable to anticipate.

------

summary(data)

     date           vehicle_sales   
 Length:575         Min.   : 670.5  
 Class :character   1st Qu.:1117.2  
 Mode  :character   Median :1268.5  
                    Mean   :1261.7  
                    3rd Qu.:1420.2  
                    Max.   :1845.7

str(data)

'data.frame':   575 obs. of  2 variables:
 $ date         : chr  "1976-01-01" "1976-02-01" "1976-03-01" "1976-04-01" ...
 $ vehicle_sales: num  885 995 1244 1191 1203 ...

print('Sd of Vehicle data:')

[1] "Sd of Vehicle data:"

sd(data$vehicle_sales)

[1] 221.9962

date <- as.Date(data$date, format = "%Y-%m-%d")
str(date)

 Date[1:575], format: "1976-01-01" "1976-02-01" "1976-03-01" "1976-04-01" "1976-05-01" ...

ggplot() +
  geom_line(aes(x = date, y = data$vehicle_sales)) + geom_smooth(aes(date, data$vehicle_sales), method = "lm", color = "red") +
  labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")

`geom_smooth()` using formula = 'y ~ x'

---Summary Stats—

The time series of vehicle sales reveals distinct seasonality, marked by consistent peaks and troughs, and volatility, evident through a notable standard deviation. The data exhibits multimodality, indicating diverse sales patterns or regimes. The close alignment of mean and median sales figures implies a symmetric distribution around a central level of sales. These characteristics suggest that while seasonal trends may be predictable, the influence of economic events and policy changes adds complexity to forecasting endeavors.

------

boxplot(data$vehicle_sales)

---Boxplot—

The boxplot representation of vehicle sales illustrates a relatively tight interquartile range, signifying that the central 50% of sales data is closely grouped within a narrow range of values. The extended tails (whiskers) on both sides indicate the existence of periods with notably lower and higher sales, portraying these as outliers compared to the predominant data. The balanced structure of the box and whiskers suggests an approximately symmetrical distribution of data around the median, without a distinct skew.

------

hist(data$vehicle_sales, freq = FALSE)
lines(density(data$vehicle_sales), lwd = 3, col = "red")

---Histogram—

The histogram depicting vehicle sales reveals a multimodal distribution characterized by multiple peaks.

-------

---Section 2---

From the preliminary exploratory analysis, it’s apparent that the dataset consists of two columns: one indicating the date, recorded on a monthly basis (in character format), and the other representing the corresponding vehicle sales for each month (numerical data). The sales figures vary between a minimum of 670 and a maximum of 1845 but the vehicle sales is between 1100 and 1400 during majority of the months. Notably, the sales records exhibit decimal values, suggesting a potential averaging of monthly data. Further investigation into the line graph reveals significant fluctuations in sales, attributable to factors such as seasonality and economic influences. There are four major dips in the vehicle sales all of which are in the same time period where there was huge drop in the economy(Early 1980s Recession, Early 1990s Recession, Global financial crisis(2008) and the covid) While additional factors may contribute, these two appear to be the primary drivers. When fitting a trend line to the plot, an overall positive trend emerges, indicating a progressive increase in vehicle sales over the observed period.

------

##6 Month average
plot(date, data$vehicle_sales, type = "l", col = "blue", xlab = "Date", ylab = "Vehicle Sales", main = "Vehicle Sales Over Time")

# Calculate and add the 3-month moving average
lines(date, zoo::rollmean(data$vehicle_sales, 6, fill = NA), col = "red", lwd = 2)

# Add legend
legend("topright", legend = c("Vehicle Sales", "6-Month Moving Average"), col = c("blue", "red"), lwd = c(1, 2), cex = 0.4)

##12 Month average
plot(date, data$vehicle_sales, type = "l", col = "blue", xlab = "Date", ylab = "Vehicle Sales", main = "Vehicle Sales Over Time")

# Calculate and add the 3-month moving average
lines(date, zoo:: rollmean(data$vehicle_sales, 12, fill = NA), col = "red", lwd = 2)

# Add legend
legend("topright", legend = c("Vehicle Sales", "12-Month Moving Average"), col = c("blue", "red"), lwd = c(1, 2), cex = 0.4)

---Moving Average—

Two moving averages have been plotted in the graph, one with a 6-month window and the other with a 12-month window. The 6-month moving average exhibits a higher sensitivity to fluctuations and noise in the time series, making it less suitable for effectively representing the underlying trend. In contrast, the 12-month moving average provides a smoother and more stable representation, making it the preferred choice for capturing the overall trend in the data.

# Calculate the remainder series by subtracting the moving average from the original series
data$Remainder <- data$vehicle_sales - rollmean(data$vehicle_sales, 12, fill = NA)

# Plot the remainder series
ggplot(data, aes(x = date, group = 1)) +
  geom_line(aes(y = Remainder), colour = "blue", na.rm = TRUE) +
  labs(title = "Remainder Series after Removing Moving Average", x = "Date", y = "Remainder") +
  theme_minimal()

---Seasonality—

Subtracting the moving average from the original time series unveils fluctuations that escape the smoothing effect of the moving average. These residual patterns could signify short-term irregularities, noise, or intricate seasonal variations that do not align with the chosen period of the moving average. The observed patterns in the remainder series hint at the presence of additional seasonality with a frequency different from the window size of the applied moving average. Irregular spikes or drops may indicate specific events or anomalies within the data.

------

if ("vehicle_sales" %in% names(data) && "date" %in% names(data) && nrow(data) > 0) {
  data_ts <- ts(data$vehicle_sales, frequency = 12, start = c(year(min(data$date)), month(min(data$date))))
  
  #STL decomposition
  data_stl <- stl(data_ts, s.window = "periodic")

  # Plot the components
  autoplot(data_stl) + labs(title = "STL Decomposition of Vehicle Sales")
} else {
  stop("The 'Sales' and/or 'Date' column is missing, or there is no data in the dataframe.")
}

---Section 3---

The trend line displays fluctuations, signifying shifts in the underlying sales levels over the long term. A distinct and consistent repeating pattern in the seasonal component is evident, pointing to a robust seasonal influence on sales. The noticeable amplitude of the seasonal component implies that seasonality has a substantial impact on sales trends. Comparatively, the remainder or residuals are relatively small in scale when contrasted with the observed seasonal fluctuations. This suggests that the STL decomposition has effectively captured the majority of systematic behavior inherent in the data. The prominent seasonality aligns with expectations for vehicle sales data, where seasonal trends are influenced by factors such as new model releases, end-of-year sales, and economic cycles. The STL decomposition underscores the importance of seasonality as a significant feature in this time series, essential for consideration in any forecasting models.

------

#Naive seasonal forecast
naive_seasonal_forecast <- snaive(data_ts, h = 6)

#Naive forecast with drift
naive_drift_forecast <- rwf(data_ts, h = 6, drift = TRUE)

#Original time series and the forecasts
autoplot(data_ts) +
  autolayer(naive_seasonal_forecast, series = "Naive Seasonal Forecast", PI = FALSE) +
  autolayer(naive_drift_forecast, series = "Naive Drift Forecast", PI = FALSE) +
  labs(title = "Naive Forecasts for Vehicle Sales", x = "Time", y = "Sales") +
  guides(colour=guide_legend(title="Legend"), position = 'topright') +
  theme_minimal()

---Section 4---

When choosing between the naive seasonal forecast and the naive forecast with drift, the decision hinges on the visibility of the trend in your time series. If a clear and consistent trend is apparent, opting for the drift method is justified. Conversely, if the trend is ambiguous or the focus is on capturing seasonality, the naive seasonal method is more suitable.

To assess the effectiveness of the naive forecast, the forecasted values are to be compared with the actual values (but in this case we do not have them) or evaluate how well the forecast aligns with the known patterns in the historical data. Given the identified strong seasonality, the naive seasonal forecast is expected to perform well in capturing the data’s behavior, especially for the seasonal component. However, if the historical data reflects a relatively stable level or an inconsistent trend, a straightforward naive seasonal forecast without drift might yield more accurate results.

------