The dataset represents the total vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data, updated monthly, serves as a critical economic indicator, reflecting consumer spending trends and the trends in the automotive industry. The dataset spans several years, offering a holistic view of vehicle sales dynamics.
Vehicle sales are very closely influenced by macro economic trends and hence fluctuations in income levels, economic policies etc. can lead to significant variability in the data. Several factors such as Government policies, environmental regulations, shift in consumer preferences, new technologies (Electric Vehicles) can have a direct impact on the industry. Vehicle sales data exhibits seasonality such as year-end sales traditionally showing higher sales figures
The impact of all the above factors and their interplay makes
forecasting this time series challenging
# Line chart
ggplot(vehicle_sales, aes(x = date, y = vehicle_sales, group = 1)) +
geom_line() +
labs(title = "Time Series of Vehicle Sales", x = "Date", y = "Total Sales")
The sales seem to be relatively stable over time and no major upward/
downward trends. There is a clear pattern of peaks and troughs that
suggests seasonality in the data. There are a few outliers (spikes and
dips) indicating extraordinary events that significantly affected
vehicle sales, such as economic shocks, policy changes, or supply chain
disruptions.
# Density Plot
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
geom_density(fill = "blue", alpha = 0.5) +
labs(title = "Density Plot of Vehicle Sales", x = "Total Sales", y = "Density") +
theme_minimal()
The density plot for vehicle sales shows a unimodal distribution with a
peak around mid-range of total sales. This indicates that the sales data
tends to cluster around a central value, with fewer occurrences of
extremely low or high sales. The symmetry of the plot suggests that the
distribution of sales is relatively balanced, without a significant skew
in either direction.
# Histogram
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
geom_histogram(binwidth = 50, fill = "blue", color = "black") + # Adjust binwidth as needed
labs(title = "Histogram of Vehicle Sales", x = "Total Sales", y = "Count") +
theme_minimal()
The histogram of vehicle sales indicates a multimodal distribution with
several peaks. This suggests the presence of multiple sales behaviors
within the time series. It also points to potential underlying factors
influencing sales volumes at different times.
# Boxplot
ggplot(vehicle_sales, aes(y = vehicle_sales)) +
geom_boxplot(fill = "blue", color = "black") +
labs(title = "Boxplot of Vehicle Sales", x = "", y = "Total Sales") +
theme_minimal()
The boxplot for vehicle sales shows a relatively concentrated
interquartile range, indicating that the middle 50% of sales data is
clustered within a narrow range of values. The presence of long tails
(whiskers) on both ends suggests there are periods with significantly
lower and higher sales, but these are outliers compared to the majority
of the data. The symmetry of the box and whiskers indicates an
approximately even distribution of data around the median, without a
pronounced skew.
# Summary Statistics
summary_stats <- vehicle_sales %>%
summarise(
Count = n(),
Mean = mean(vehicle_sales, na.rm = TRUE),
Median = median(vehicle_sales, na.rm = TRUE),
Mode = as.numeric(names(sort(table(vehicle_sales), decreasing = TRUE)[1])), # Mode is less common in continuous data
SD = sd(vehicle_sales, na.rm = TRUE),
Range = max(vehicle_sales, na.rm = TRUE) - min(vehicle_sales, na.rm = TRUE),
IQR = IQR(vehicle_sales, na.rm = TRUE)
)
summary_stats
## Count Mean Median Mode SD Range IQR
## 1 575 1261.744 1268.455 839.3 221.9962 1175.247 302.971
The vehicle sales time series shows seasonality, indicated by regular peaks and troughs, and volatility, as evidenced by a high standard deviation and outliers identified in the boxplot. The data is multimodal, suggesting different sales patterns or regimes, and the close proximity of the mean and median sales figures suggests a symmetric distribution around a central level of sales. These characteristics imply that while seasonal trends may be predictable, the impact of economic events and policy changes introduces complexity into forecasting efforts.
# Calculate the moving average
window_size <- 12 # for a 12-month moving average
vehicle_sales <- vehicle_sales %>%
arrange(date) %>% # Make sure the data is sorted by date
mutate(MovingAverage = rollmean(vehicle_sales, k = window_size, fill = NA, align = 'center'))
# Plot the time series with the moving average
ggplot(vehicle_sales, aes(x = date, group = 1)) +
geom_line(aes(y = vehicle_sales), colour = "blue") +
geom_line(aes(y = MovingAverage), colour = "red", na.rm = TRUE) +
labs(title = "Vehicle Sales with 12-Month Moving Average", x = "Date", y = "Total Sales") +
theme_minimal()
# Calculate the remainder series by subtracting the moving average from the original series
vehicle_sales$Remainder <- vehicle_sales$vehicle_sales - vehicle_sales$MovingAverage
# Plot the remainder series
ggplot(vehicle_sales, aes(x = date, group = 1)) +
geom_line(aes(y = Remainder), colour = "green", na.rm = TRUE) +
labs(title = "Remainder Series after Removing Moving Average", x = "Date", y = "Remainder") +
theme_minimal()
Subtracting the moving average from the original time series reveals the
fluctuations that the moving average does not capture. These could
represent short-term irregularities, noise, or more complex seasonal
patterns not aligned with the period of the moving average used.
# Check if the vehicle_sales dataframe has the correct columns and data
if ("vehicle_sales" %in% names(vehicle_sales) && "date" %in% names(vehicle_sales) && nrow(vehicle_sales) > 0) {
# Assuming vehicle_sales$Date is in the proper Date format
vehicle_sales_ts <- ts(vehicle_sales$vehicle_sales, frequency = 12, start = c(year(min(vehicle_sales$date)), month(min(vehicle_sales$date))))
# Perform STL decomposition
vehicle_sales_stl <- stl(vehicle_sales_ts, s.window = "periodic")
# Plot the components
autoplot(vehicle_sales_stl) + labs(title = "STL Decomposition of Vehicle Sales")
} else {
stop("The 'Sales' and/or 'Date' column is missing, or there is no data in the dataframe.")
}
The fluctuations in the trend line suggest changes in the underlying sales levels over the long term. The seasonal component shows a consistent repeating pattern each year, which suggests a strong seasonal influence. The amplitude of the seasonal component is quite noticeable, implying that the seasonality has a significant impact on sales.
The remainder, or residuals, are relatively small compared to the seasonal fluctuations, indicating that the STL decomposition has captured most of the systematic behavior in the data.
# Create a naive seasonal forecast
naive_seasonal_forecast <- snaive(vehicle_sales_ts, h = 6)
# Create a naive forecast with drift
naive_drift_forecast <- rwf(vehicle_sales_ts, h = 6, drift = TRUE)
# Plot the original time series and the forecasts
autoplot(vehicle_sales_ts) +
autolayer(naive_seasonal_forecast, series = "Naive Seasonal Forecast", PI = FALSE) +
autolayer(naive_drift_forecast, series = "Naive Drift Forecast", PI = FALSE) +
labs(title = "Naive Forecasts for Vehicle Sales", x = "Time", y = "Sales") +
guides(colour=guide_legend(title="Legend")) +
theme_minimal()
As the analysis suggest a strong seasonality, the naive seasonal
forecast is expected to perform well in representing the behavior of the
data.