Total Vehicle Sales

Author

Moyuri Sarkar

DESCRIPTION

The Total Vehicle Sales is a data set retrived from the Federal Reserve Economic Data (FRED). FRED is an online database consisting of hundreds of thousands of economic data time series from scores of national, international, public, and private sources. The data was sourced from U.S. Bureau of Economic Analysis. It contains data about the number of units of vehicles (in thousands) sold every month. The data set is a time series comprising 587 monthly observations of vehicle sales, spanning from January 1976 till November 2024. The seasonality is not adjusted. For our analysis, we will use the data ranging from January 2010 to December 2019.

The variation in vehicle sales over time is likely driven by factors such as:

Economic Cycles
Seasonality
Policy Changes
External Shocks

Periods of economic growth or recession significantly impact consumer purchasing power. Sales may peak during certain months due to holidays, end-of-year promotions, or new model releases. Tax incentives or regulations on emissions could affect vehicle demand. Events like oil price fluctuations, pandemics, or supply chain disruptions can cause sudden changes.

Forecasting this data set might be manageable due to its structure and historical length. However, incorporating external data like economic indicators or policy changes could significantly improve accuracy. Without such context, the model may struggle with unexpected variability, especially during periods of significant market disruption.

Descriptive Analysis: Data Summary and Distribution

# Load necessary libraries
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(forecast)

Warning: package 'forecast' was built under R version 4.4.2

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(tseries)

Warning: package 'tseries' was built under R version 4.4.2

library(tidyr)

# Load the dataset
vehicle_sales <- read.csv("Filtered_Vehicle_Sales_Data.csv")
vehicle_sales$date <- as.Date(vehicle_sales$date)

# Section 2: Exploratory Data Analysis
# Line chart of the time series
ggplot(vehicle_sales, aes(x = date, y = vehicle_sales)) +
  geom_line() +
  labs(title = "Total Vehicle Sales Over Time", x = "Date", y = "Vehicle Sales") +
  theme_minimal()

It can be observed that there is a steady increase in the sales of vehicles over the years.
The monthly vehicle sales exhibits regular peaks and troughs, indicating a seasonal pattern. This could be due to factors like end-of-year promotions, holidays, or cyclical demand patterns.
While the general trend is upward, there are significant fluctuations in sales within each year.
Toward the end of the period (around 2017–2018), the upward momentum seems to moderate, with smaller increases in sales compared to earlier years.

# Density plot of vehicle sales
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Density Plot of Vehicle Sales", x = "Vehicle Sales", y = "Density") +
  theme_minimal()

# Histogram of vehicle sales
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
  geom_histogram(binwidth = 100, fill = "blue", alpha = 0.5) +
  labs(title = "Histogram of Vehicle Sales", x = "Vehicle Sales", y = "Frequency") +
  theme_minimal()

# Boxplot of vehicle sales by year
vehicle_sales$year <- format(vehicle_sales$date, "%Y")
ggplot(vehicle_sales, aes(x = year, y = vehicle_sales)) +
  geom_boxplot(fill = "blue", alpha = 0.5) +
  labs(title = "Boxplot of Vehicle Sales by Year", x = "Year", y = "Vehicle Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The vehicle sales data is moderately concentrated around 1,250,000 to 1,500,000.
There is symmetry in the distribution, with no apparent heavy tails or significant skewness.
The data does not indicate extreme outliers or anomalies.

# Summary statistics table
summary_stats <- vehicle_sales %>%
  summarise(
    Observations = n(),
    Mean = mean(vehicle_sales, na.rm = TRUE),
    Median = median(vehicle_sales, na.rm = TRUE),
    Mode = as.numeric(names(sort(table(vehicle_sales), decreasing = TRUE)[1])),
    Standard_Deviation = sd(vehicle_sales, na.rm = TRUE),
    Range = max(vehicle_sales, na.rm = TRUE) - min(vehicle_sales, na.rm = TRUE)
  )
print(summary_stats)

  Observations     Mean   Median    Mode Standard_Deviation    Range
1          120 1340.372 1378.056 712.469            222.056 1007.158

Moving Average

# Section 3: Moving Average Analysis and Time Series Decomposition
# Calculate and visualize moving average
vehicle_sales <- vehicle_sales %>%
  mutate(moving_avg = zoo::rollmean(vehicle_sales, k = 12, fill = NA))

ggplot(vehicle_sales, aes(x = date)) +
  geom_line(aes(y = vehicle_sales), color = "blue") +
  geom_line(aes(y = moving_avg), color = "red", linetype = "dashed") +
  labs(title = "Moving Average of Vehicle Sales", x = "Date", y = "Vehicle Sales") +
  theme_minimal()

Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_line()`).

There is a steady growth in vehicle sales as per the 12 month moving average (red line).
After 2016, the moving average appears flatten suggesting that the vehicles sales might have reached a saturation point or is experiencing external factors as economical slowdown, policy changes or consumer preferences.

Remainder

# Calculate and visualize remainder series
vehicle_sales <- vehicle_sales %>%
  mutate(remainder = vehicle_sales - moving_avg)

ggplot(vehicle_sales, aes(x = date, y = remainder)) +
  geom_line(color = "green") +
  labs(title = "Remainder Series After Subtracting Moving Average", x = "Date", y = "Remainder") +
  theme_minimal()

Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_line()`).

The remainder series shows significant fluctuations, indicating that the moving average does not fully capture short-term variations in vehicle sales.
The recurring peaks and troughs suggest a seasonal pattern, meaning the moving average alone may not be sufficient to model the seasonal component.
Some extreme spikes indicate possible outliers or unexpected events affecting vehicle sales, such as economic downturns, policy changes, or supply chain disruptions.
The residuals do not appear completely random, implying that there might be additional underlying structures, such as cyclical trends or external factors influencing sales.

Time Series Decomposition

# Time series decomposition
ts_data <- ts(vehicle_sales$vehicle_sales, frequency = 12)
decomposed <- decompose(ts_data)

# Plot decomposed time series
plot(decomposed)

Based on the decomposition, seasonality does not appear to be strong, and if present, it is relatively weak.
This matches expectations for vehicle sales, which may be influenced by economic factors, promotions, and policy changes, rather than strict seasonal patterns like retail sales or temperature-based industries.

Naive forecast

# Section 4: Naive Forecast
# Naive forecast for 6 time periods
naive_forecast <- naive(ts_data, h = 6)

# Plot naive forecast
autoplot(naive_forecast) +
  labs(title = "Naive Forecast for Vehicle Sales", x = "Time", y = "Vehicle Sales") +
  theme_minimal()

Naive forecast with drift

# Naive forecast with drift for 6 time periods
naive_drift_forecast <- rwf(ts_data, h = 6, drift = TRUE)

# Plot naive forecast with drift
autoplot(naive_drift_forecast) +
  labs(title = "Naive Forecast with Drift for Vehicle Sales", 
       x = "Time", 
       y = "Vehicle Sales") +
  theme_minimal()

The naive forecast with drift extends the most recent trend into the future, which aligns with the historical pattern observed in vehicle sales. This method incorporates the natural increase in sales over time rather than assuming a flat forecast.
The naive forecast with drift does not capture seasonality explicitly. If strong seasonal patterns were present (e.g., higher sales during certain months), this method would not adequately account for that.
The 80% and 95% confidence intervals widen as we move forward in time, indicating increasing uncertainty in predictions. This reflects real-world unpredictability in vehicle sales.

Naive Forecast with drift is suitable for data with a long-term trend (as seen in vehicle sales). It captures the general direction of the time series better than a naive forecast. However, it does not model seasonality, which could be important for industries affected by time-based patterns. Also, it does not react to external shocks (e.g., economic downturns, policy changes, or supply chain disruptions).