Code
suppressWarnings({
suppressMessages({
library(dplyr)
library(ggplot2)
library(slider)
library(tidyr)
library(readxl)
library(zoo)
library(forecast)
library(lubridate)
})
})suppressWarnings({
suppressMessages({
library(dplyr)
library(ggplot2)
library(slider)
library(tidyr)
library(readxl)
library(zoo)
library(forecast)
library(lubridate)
})
})data <- read.csv("/Users/lasyapinnamaneni/Desktop/MS BANA/Time series/total_vehicle_sales.csv", header = TRUE)
# Convert date to appropriate format if necessary
data$date <- as.Date(data$date) # Assuming date column is already in date format
# Plot the data
ggplot(data, aes(x = date, y = vehicle_sales)) +
geom_line(color = "#0099f9", size = 1.4) +
theme_minimal() + # Changed to theme_minimal for better visibility
theme(
axis.text = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 15),
plot.title = element_text(size = 18, face = "bold")
) +
labs(
title = "Average Vehicle Sales Dataset",
x = "Date",
y = "Average Vehicle Sales"
)Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
The dataset represents the total monthly vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data is subject to monthly updates and serves as a pivotal economic indicator, capturing variations in consumer spending patterns and trends within the automotive industry. Spanning multiple years, the dataset provides a comprehensive perspective on the dynamics of vehicle sales, making it a valuable resource for analyzing long-term trends and patterns in the market.
Macroeconomic trends, government policies, environmental regulations, shifts in consumer preferences, technological advancements, and seasonal patterns all contribute to variations in vehicle sales. Forecasting Total Vehicle Sales is challenging due to the dynamic nature of these factors, introducing high levels of uncertainty. The industry’s susceptibility to unforeseen events, rapid technological changes, and the intricate web of influencing variables make accurate forecasting a complex task that requires a sophisticated approach and continuous monitoring of multiple factors.
date <- as.Date(data$date, format = "%Y-%m-%d")
str(date) Date[1:575], format: "1976-01-01" "1976-02-01" "1976-03-01" "1976-04-01" "1976-05-01" ...
ggplot() +
geom_line(aes(x = date, y = data$vehicle_sales)) + geom_smooth(aes(date, data$vehicle_sales), method = "lm", color = "red") +
labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")`geom_smooth()` using formula = 'y ~ x'
boxplot(data$vehicle_sales)Boxplot—In the boxplot, there are no outliers detected, and a symmetrical distribution is evident.
hist(data$vehicle_sales, freq = FALSE)
lines(density(data$vehicle_sales), lwd = 3, col = "red")The histogram depicting vehicle sales reveals a bimodal distribution characterized by dual peaks.
ggplot(data, aes(x = vehicle_sales)) +
geom_density(fill = "blue", alpha = 0.5) +
labs(title = "Density Plot of Vehicle Sales", x = "Total Sales", y = "Density") +
theme_minimal()The density curve representing vehicle sales illustrates a bimodal pattern characterized by two distinct peaks. Furthermore, the distribution appears symmetric, suggesting that the mean is equal to the median.
# Calculate summary statistics
summary_stats <- summary(data$vehicle_sales)
# Create a table
summary_table <- data.frame(
Metric = c("Number of Observations", "Mean", "Median", "Mode", "Standard Deviation", "Range"),
Value = c(
length(data$vehicle_sales),
mean(data$vehicle_sales),
median(data$vehicle_sales),
table(data$vehicle_sales)[which.max(table(data$vehicle_sales))],
sd(data$vehicle_sales),
diff(range(data$vehicle_sales))
)
)
# Print the table
print(summary_table) Metric Value
1 Number of Observations 575.0000
2 Mean 1261.7437
3 Median 1268.4550
4 Mode 2.0000
5 Standard Deviation 221.9962
6 Range 1175.2470
##6 Month average
plot(date, data$vehicle_sales, type = "l", col = "blue", xlab = "Date", ylab = "Vehicle Sales", main = "Vehicle Sales Over Time")
# Calculate and add the 6-month moving average
lines(date, zoo::rollmean(data$vehicle_sales, 6, fill = NA), col = "red", lwd = 2)
# Add legend
legend("topright", legend = c("Vehicle Sales", "6-Month Moving Average"), col = c("blue", "red"), lwd = c(1, 2), cex = 0.4)##12 Month average
plot(date, data$vehicle_sales, type = "l", col = "blue", xlab = "Date", ylab = "Vehicle Sales", main = "Vehicle Sales Over Time")
# Calculate and add the 12-month moving average
lines(date, zoo:: rollmean(data$vehicle_sales, 12, fill = NA), col = "red", lwd = 2)
# Add legend
legend("topright", legend = c("Vehicle Sales", "12-Month Moving Average"), col = c("blue", "red"), lwd = c(1, 2), cex = 0.4)The graph displays two moving averages: one with a 6-month window and the other with a 12-month window. The 6-month moving average is more sensitive to fluctuations and noise in the time series, making it less effective in representing the underlying trend. On the other hand, the 12-month moving average offers a smoother and more stable representation, making it the preferred choice for capturing the overall trend in the data.
# Calculate the remainder series by subtracting the moving average from the original series
data$Remainder <- data$vehicle_sales - rollmean(data$vehicle_sales, 12, fill = NA)
# Plot the remainder series
ggplot(data, aes(x = date, group = 1)) +
geom_line(aes(y = Remainder), colour = "blue", na.rm = TRUE) +
labs(title = "Remainder Series after Removing Moving Average", x = "Date", y = "Remainder") +
theme_minimal()
The remainder plot illustrates the seasonality component not explained by the moving average. To capture this seasonality, performing time series decomposition is essential.
if ("vehicle_sales" %in% names(data) && "date" %in% names(data) && nrow(data) > 0) {
data_ts <- ts(data$vehicle_sales, frequency = 12, start = c(year(min(data$date)), month(min(data$date))))
#STL decomposition
data_stl <- stl(data_ts, s.window = "periodic")
# Plot the components
autoplot(data_stl) + labs(title = "STL Decomposition of Vehicle Sales")
} else {
stop("The 'Sales' and/or 'Date' column is missing, or there is no data in the dataframe.")
}The variations in the trend line indicate shifts in the underlying sales levels over an extended period. A consistent repeating pattern in the seasonal component suggests a robust seasonal influence, with a noticeable amplitude signifying a significant impact on sales. The residuals, representing the remaining unexplained variation, are relatively small compared to the seasonal fluctuations. This indicates that the STL decomposition effectively captures the majority of the systematic behavior in the data.
#Naive seasonal forecast
naive_seasonal_forecast <- snaive(data_ts, h = 6)
#Naive forecast with drift
naive_drift_forecast <- rwf(data_ts, h = 6, drift = TRUE)
#Original time series and the forecasts
autoplot(data_ts) +
autolayer(naive_seasonal_forecast, series = "Naive Seasonal Forecast", PI = FALSE) +
autolayer(naive_drift_forecast, series = "Naive Drift Forecast", PI = FALSE) +
labs(title = "Naive Forecasts for Vehicle Sales", x = "Time", y = "Sales") +
guides(colour=guide_legend(title="Legend"), position = 'topright') +
theme_minimal()
The Naive forecast with drift accurately captures the behavior of the data, especially for Total Vehicle Sales, revealing a distinct linear trend along with seasonal patterns. When dealing with seasonal data like Vehicle Sales, including a drift component proves advantageous, making Naive with Drift a better option than the basic Naive forecast. This method enables the model to account for both the linear trend and seasonality, improving its overall forecasting performance.