Time Series Data Description, Exploratory Analysis, and Decomposition

Author

Lasya Pinnamaneni

Section 1

Reading Libraries

Code
suppressWarnings({
  suppressMessages({
library(dplyr)
library(ggplot2)
library(slider)
library(tidyr)
library(readxl)
library(zoo)
library(forecast)
library(lubridate)
  })
})

Read Total Vehicle Sales data

Code
data <- read.csv("/Users/lasyapinnamaneni/Desktop/MS BANA/Time series/total_vehicle_sales.csv", header = TRUE)

# Convert date to appropriate format if necessary
data$date <- as.Date(data$date) # Assuming date column is already in date format

# Plot the data
ggplot(data, aes(x = date, y = vehicle_sales)) +
  geom_line(color = "#0099f9", size = 1.4) + 
  theme_minimal() +  # Changed to theme_minimal for better visibility
  theme(
    axis.text = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 15),
    plot.title = element_text(size = 18, face = "bold")
  ) + 
  labs(
    title = "Average Vehicle Sales Dataset",
    x = "Date",
    y = "Average Vehicle Sales"
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The dataset represents the total monthly vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data is subject to monthly updates and serves as a pivotal economic indicator, capturing variations in consumer spending patterns and trends within the automotive industry. Spanning multiple years, the dataset provides a comprehensive perspective on the dynamics of vehicle sales, making it a valuable resource for analyzing long-term trends and patterns in the market.

Macroeconomic trends, government policies, environmental regulations, shifts in consumer preferences, technological advancements, and seasonal patterns all contribute to variations in vehicle sales. Forecasting Total Vehicle Sales is challenging due to the dynamic nature of these factors, introducing high levels of uncertainty. The industry’s susceptibility to unforeseen events, rapid technological changes, and the intricate web of influencing variables make accurate forecasting a complex task that requires a sophisticated approach and continuous monitoring of multiple factors.

Section 2

Code
date <- as.Date(data$date, format = "%Y-%m-%d")
str(date)
 Date[1:575], format: "1976-01-01" "1976-02-01" "1976-03-01" "1976-04-01" "1976-05-01" ...

Monthly Vehicles sales with Trendline

Code
ggplot() +
  geom_line(aes(x = date, y = data$vehicle_sales)) + geom_smooth(aes(date, data$vehicle_sales), method = "lm", color = "red") +
  labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")
`geom_smooth()` using formula = 'y ~ x'

BoxPlot

Code
boxplot(data$vehicle_sales)

Boxplot—In the boxplot, there are no outliers detected, and a symmetrical distribution is evident.

Histogram

Code
hist(data$vehicle_sales, freq = FALSE)
lines(density(data$vehicle_sales), lwd = 3, col = "red")

The histogram depicting vehicle sales reveals a bimodal distribution characterized by dual peaks.

Density Curve

Code
ggplot(data, aes(x = vehicle_sales)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Density Plot of Vehicle Sales", x = "Total Sales", y = "Density") +
  theme_minimal()

The density curve representing vehicle sales illustrates a bimodal pattern characterized by two distinct peaks. Furthermore, the distribution appears symmetric, suggesting that the mean is equal to the median.

Summary Statistics

Code
# Calculate summary statistics
summary_stats <- summary(data$vehicle_sales)

# Create a table
summary_table <- data.frame(
  Metric = c("Number of Observations", "Mean", "Median", "Mode", "Standard Deviation", "Range"),
  Value = c(
    length(data$vehicle_sales),
    mean(data$vehicle_sales),
    median(data$vehicle_sales),
    table(data$vehicle_sales)[which.max(table(data$vehicle_sales))],
    sd(data$vehicle_sales),
    diff(range(data$vehicle_sales))
  )
)

# Print the table
print(summary_table)
                  Metric     Value
1 Number of Observations  575.0000
2                   Mean 1261.7437
3                 Median 1268.4550
4                   Mode    2.0000
5     Standard Deviation  221.9962
6                  Range 1175.2470

Section 3

6 Month average

Code
##6 Month average
plot(date, data$vehicle_sales, type = "l", col = "blue", xlab = "Date", ylab = "Vehicle Sales", main = "Vehicle Sales Over Time")

# Calculate and add the 6-month moving average
lines(date, zoo::rollmean(data$vehicle_sales, 6, fill = NA), col = "red", lwd = 2)

# Add legend
legend("topright", legend = c("Vehicle Sales", "6-Month Moving Average"), col = c("blue", "red"), lwd = c(1, 2), cex = 0.4)

12 Month average

Code
##12 Month average
plot(date, data$vehicle_sales, type = "l", col = "blue", xlab = "Date", ylab = "Vehicle Sales", main = "Vehicle Sales Over Time")

# Calculate and add the 12-month moving average
lines(date, zoo:: rollmean(data$vehicle_sales, 12, fill = NA), col = "red", lwd = 2)

# Add legend
legend("topright", legend = c("Vehicle Sales", "12-Month Moving Average"), col = c("blue", "red"), lwd = c(1, 2), cex = 0.4)

Moving Average & Remainder

The graph displays two moving averages: one with a 6-month window and the other with a 12-month window. The 6-month moving average is more sensitive to fluctuations and noise in the time series, making it less effective in representing the underlying trend. On the other hand, the 12-month moving average offers a smoother and more stable representation, making it the preferred choice for capturing the overall trend in the data.

Code
# Calculate the remainder series by subtracting the moving average from the original series
data$Remainder <- data$vehicle_sales - rollmean(data$vehicle_sales, 12, fill = NA)

# Plot the remainder series
ggplot(data, aes(x = date, group = 1)) +
  geom_line(aes(y = Remainder), colour = "blue", na.rm = TRUE) +
  labs(title = "Remainder Series after Removing Moving Average", x = "Date", y = "Remainder") +
  theme_minimal()

Seasonality


The remainder plot illustrates the seasonality component not explained by the moving average. To capture this seasonality, performing time series decomposition is essential.

Code
if ("vehicle_sales" %in% names(data) && "date" %in% names(data) && nrow(data) > 0) {
  data_ts <- ts(data$vehicle_sales, frequency = 12, start = c(year(min(data$date)), month(min(data$date))))
  
  #STL decomposition
  data_stl <- stl(data_ts, s.window = "periodic")

  # Plot the components
  autoplot(data_stl) + labs(title = "STL Decomposition of Vehicle Sales")
} else {
  stop("The 'Sales' and/or 'Date' column is missing, or there is no data in the dataframe.")
}

The variations in the trend line indicate shifts in the underlying sales levels over an extended period. A consistent repeating pattern in the seasonal component suggests a robust seasonal influence, with a noticeable amplitude signifying a significant impact on sales. The residuals, representing the remaining unexplained variation, are relatively small compared to the seasonal fluctuations. This indicates that the STL decomposition effectively captures the majority of the systematic behavior in the data.

Section 4

Code
#Naive seasonal forecast
naive_seasonal_forecast <- snaive(data_ts, h = 6)

#Naive forecast with drift
naive_drift_forecast <- rwf(data_ts, h = 6, drift = TRUE)

#Original time series and the forecasts
autoplot(data_ts) +
  autolayer(naive_seasonal_forecast, series = "Naive Seasonal Forecast", PI = FALSE) +
  autolayer(naive_drift_forecast, series = "Naive Drift Forecast", PI = FALSE) +
  labs(title = "Naive Forecasts for Vehicle Sales", x = "Time", y = "Sales") +
  guides(colour=guide_legend(title="Legend"), position = 'topright') +
  theme_minimal()


The Naive forecast with drift accurately captures the behavior of the data, especially for Total Vehicle Sales, revealing a distinct linear trend along with seasonal patterns. When dealing with seasonal data like Vehicle Sales, including a drift component proves advantageous, making Naive with Drift a better option than the basic Naive forecast. This method enables the model to account for both the linear trend and seasonality, improving its overall forecasting performance.