Total Vehicle Sales

Author

Moyuri Sarkar

DESCRIPTION

The Total Vehicle Sales is a data set retrived from the Federal Reserve Economic Data (FRED). FRED is an online database consisting of hundreds of thousands of economic data time series from scores of national, international, public, and private sources. The data was sourced from U.S. Bureau of Economic Analysis. It contains data about the number of units of vehicles (in thousands) sold every month. The data set is a time series comprising 587 monthly observations of vehicle sales, spanning from January 1976 till November 2024. The seasonality is not adjusted. For our analysis, we will use the data ranging from January 2010 to December 2019.

The variation in vehicle sales over time is likely driven by factors such as:

Economic Cycles
Seasonality
Policy Changes
External Shocks

Periods of economic growth or recession significantly impact consumer purchasing power. Sales may peak during certain months due to holidays, end-of-year promotions, or new model releases. Tax incentives or regulations on emissions could affect vehicle demand. Events like oil price fluctuations, pandemics, or supply chain disruptions can cause sudden changes.

Forecasting this data set might be manageable due to its structure and historical length. However, incorporating external data like economic indicators or policy changes could significantly improve accuracy. Without such context, the model may struggle with unexpected variability, especially during periods of significant market disruption.

Descriptive Analysis: Data Summary and Distribution

library(tsibble)

Warning: package 'tsibble' was built under R version 4.4.2

Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr


Attaching package: 'tsibble'

The following objects are masked from 'package:base':

    intersect, setdiff, union

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(scales)
library(lubridate)


Attaching package: 'lubridate'

The following object is masked from 'package:tsibble':

    interval

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(ggpubr)

Warning: package 'ggpubr' was built under R version 4.4.2

library(kableExtra)

Warning: package 'kableExtra' was built under R version 4.4.2


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(feasts)

Warning: package 'feasts' was built under R version 4.4.2

Loading required package: fabletools

Warning: package 'fabletools' was built under R version 4.4.2

library(fable)

Warning: package 'fable' was built under R version 4.4.2

# Load the dataset
data <- read.csv("total_vehicle_sales.csv")

# Filter data to a specific time range (2010-2019) and convert to tsibble
data <- data %>%
  mutate(DATE = as.Date(date), vehicle_sales = vehicle_sales * 1000) %>%  # Assuming sales are in thousands
  filter(DATE >= "2010-01-01" & DATE <= "2019-12-01") %>%
  as_tsibble(index = DATE)

# Split into training and testing sets
train <- data %>%
  filter(DATE < ymd("2018-01-01"))

test <- data %>%
  filter(DATE >= ymd("2018-01-01"))

# Plot the general trend of vehicle sales over time
data_line <- train %>%
  ggplot() +
  geom_line(aes(DATE, vehicle_sales), color = "black") +
  theme_bw() +
  xlab("Date") +
  scale_y_continuous(
    name = "Vehicle Sales",
    labels = comma
  ) +
  ggtitle("Line Chart - General Trend of Vehicle Sales Over the Years") +
  labs(caption = "Figure 1") +
  theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.5)) +
  geom_smooth(aes(DATE, vehicle_sales), method = "lm", color = "red")

# Display the plot
data_line

`geom_smooth()` using formula = 'y ~ x'

It can be observed that there is a steady increase in the sales of vehicles over the years.
The monthly vehicle sales exhibits regular peaks and troughs, indicating a seasonal pattern. This could be due to factors like end-of-year promotions, holidays, or cyclical demand patterns.
While the general trend is upward, there are significant fluctuations in sales within each year.
Toward the end of the period (around 2017–2018), the upward momentum seems to moderate, with smaller increases in sales compared to earlier years.

data_hist <- train %>%
  ggplot() +
  geom_histogram(aes(vehicle_sales), bins = 30, fill = "steelblue", color = "black") +
  theme_bw() +
  xlab("Vehicle Sales") +
  scale_y_continuous(name = "Count") +
  scale_x_continuous(labels = scales::comma) +
  coord_flip()

data_dens <- train %>%
  ggplot() +
  geom_density(aes(vehicle_sales), fill = "steelblue", alpha = 0.7) +
  theme_bw() +
  xlab("Vehicle Sales") +
  scale_y_continuous(name = "Density") +
  scale_x_continuous(labels = scales::comma) +
  coord_flip()

data_violin <- train %>%
  ggplot() +
  geom_violin(aes("", vehicle_sales), fill = "steelblue", color = "black") +
  theme_bw() +
  xlab("x") +
  scale_y_continuous(name = "Vehicle Sales", labels = scales::comma)

data_boxplot <- train %>%
  ggplot() +
  geom_boxplot(aes("", vehicle_sales), fill = "steelblue", color = "black") +
  theme_bw() +
  xlab("x") +
  scale_y_continuous(name = "Vehicle Sales", labels = scales::comma)

graphs <- ggarrange(data_hist, data_dens, data_violin, data_boxplot + rremove("x.text"), ncol = 2, nrow = 2) 
annotate_figure(graphs, top = text_grob("Distribution of Vehicle Sales"), bottom = "Figure 2")

The vehicle sales data is moderately concentrated around 1,250,000 to 1,500,000.
There is symmetry in the distribution, with no apparent heavy tails or significant skewness.
The data does not indicate extreme outliers or anomalies.

# Define mode function
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Calculate statistics
stats <- data.frame(
  Metric = c(
    "No. of Observations",
    "Min.",
    "Max.",
    "Mean",
    "Median",
    "Mode",
    "Standard Deviation",
    "Range"
  ),
  Values = c(
    nrow(train),
    paste0(round(min(train$vehicle_sales, na.rm = TRUE), 2)),
    paste0(round(max(train$vehicle_sales, na.rm = TRUE), 2)),
    paste0(round(mean(train$vehicle_sales, na.rm = TRUE), 2)),
    paste0(round(median(train$vehicle_sales, na.rm = TRUE), 2)),
    paste0(getmode(train$vehicle_sales)),
    round(sd(train$vehicle_sales, na.rm = TRUE), 2),
    paste0(
      round(diff(range(train$vehicle_sales, na.rm = TRUE)), 2)
    )
  )
)

# Render the table
stats %>%
  kbl(caption = "Table 1: Vehicle Sales Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Table 1: Vehicle Sales Statistics
Metric	Values
No. of Observations	96
Min.	712469
Max.	1719627
Mean	1308788.64
Median	1329975
Mode	712469
Standard Deviation	226186.16
Range	1007158

Moving Average

# Calculate moving average (12-month window)
data <- data %>%
  mutate(moving_average = slider::slide_dbl(vehicle_sales, mean, .before = 5, .after = 6, .complete = TRUE))

# Plot the moving average
data_ma_plot <- data %>%
  ggplot() +
  geom_line(aes(DATE, vehicle_sales), color = "steelblue", alpha = 0.6, lwd = 1) +
  geom_line(aes(DATE, moving_average), color = "red", lwd = 1.2) +
  theme_minimal() +
  xlab("Date") +
  scale_y_continuous(
    name = "Vehicle Sales",
    labels = comma
  ) +
  ggtitle("Moving Average of Vehicle Sales (2010-2019)") +
  theme(plot.title = element_text(hjust = 0.5))

# Display the moving average plot
data_ma_plot

Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_line()`).

There is a steady growth in vehicle sales as per the 12 month moving average (red line).
After 2016, the moving average appears flatten suggesting that the vehicles sales might have reached a saturation point or is experiencing external factors as economical slowdown, policy changes or consumer preferences.

Remainder

# Calculate remainder series (original - moving average)
data <- data %>%
  mutate(remainder = vehicle_sales - moving_average)

# Plot the remainder series
data_remainder_plot <- data %>%
  ggplot() +
  geom_line(aes(DATE, remainder), color = "steelblue", lwd = 1) +
  theme_minimal() +
  xlab("Date") +
  ylab("Remainder") +
  ggtitle("Remainder Series (Original - Moving Average)") +
  theme(plot.title = element_text(hjust = 0.5))

# Display the remainder plot
data_remainder_plot

Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_line()`).

The remainder series shows significant fluctuations, indicating that the moving average does not fully capture short-term variations in vehicle sales.
The recurring peaks and troughs suggest a seasonal pattern, meaning the moving average alone may not be sufficient to model the seasonal component.
Some extreme spikes indicate possible outliers or unexpected events affecting vehicle sales, such as economic downturns, policy changes, or supply chain disruptions.
The residuals do not appear completely random, implying that there might be additional underlying structures, such as cyclical trends or external factors influencing sales.

Time Series Decomposition

library(zoo)


Attaching package: 'zoo'

The following object is masked from 'package:tsibble':

    index

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(tidyr)
# Perform time series decomposition
data_decomp <- data %>%
  mutate(
    vehicle_sales = vehicle_sales / 1000,  # Scaling down for analysis
    ma_13_center = rollapply(
      vehicle_sales * 1000,
      13,
      FUN = mean,
      align = "center", fill = NA
    )
  ) %>%
  mutate(
    resid = vehicle_sales * 1000 - ma_13_center
  ) %>%
  select(DATE, vehicle_sales, ma_13_center, resid)

data_decomp$vehicle_sales <- data_decomp$vehicle_sales * 1000

# Prepare data for plotting
data_decomp_plot <- data_decomp %>%
  pivot_longer(
    vehicle_sales:resid,
    names_to = "decomposition",
    values_to = "value"
  ) %>%
  mutate(
    decomposition = case_when(
      decomposition == "vehicle_sales" ~ "Vehicle Sales",
      decomposition == "ma_13_center" ~ "Trend",
      decomposition == "resid" ~ "Remainder"
    )
  ) %>%
  mutate(
    decomposition = factor(
      decomposition,
      labels = c(
        "Vehicle Sales",
        "Trend",
        "Remainder"
      ),
      levels = c(
        "Vehicle Sales",
        "Trend",
        "Remainder"
      )
    )
  ) %>%
  ggplot() +
  geom_line(aes(DATE, value), size = 1) +
  facet_wrap(
    ~decomposition,
    nrow = 3,
    scales = "free"
  ) +
  theme_bw() +
  ylab("") +
  xlab("Date") +
  ggtitle("Vehicle Sales = Trend + Remainder") +
  labs(caption = "Figure 4") +
  theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.5))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Display the decomposition plot
data_decomp_plot

Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).

Based on the decomposition, seasonality does not appear to be strong, and if present, it is relatively weak.
This matches expectations for vehicle sales, which may be influenced by economic factors, promotions, and policy changes, rather than strict seasonal patterns like retail sales or temperature-based industries.

Naive forecast

# Naive Forecast (6 time periods ahead)
train <- data %>% filter(DATE < ymd("2019-07-01"))

naive_forecast <- train %>%
  model(naive_model = NAIVE(vehicle_sales)) %>%
  forecast(h = 6)

Warning: 1 error encountered for naive_model
[1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.

# Plot naive forecast
naive_forecast_plot <- naive_forecast %>%
  autoplot(data) +
  ggtitle("6-Period Naive Forecast for Vehicle Sales") +
  xlab("Date") +
  ylab("Vehicle Sales") +
  theme_bw()

# Display the naive forecast plot
naive_forecast_plot

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).

Naive forecast with drift

# Naive Forecast with Drift (6 time periods ahead)
train <- data %>% filter(DATE < ymd("2019-07-01"))

naive_drift_forecast <- train %>%
  model(naive_drift_model = RW(vehicle_sales ~ drift())) %>%
  forecast(h = 6)

Warning: 1 error encountered for naive_drift_model
[1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.

# Plot naive forecast with drift
naive_drift_forecast_plot <- naive_drift_forecast %>%
  autoplot(data) +
  ggtitle("6-Period Naive Forecast with Drift for Vehicle Sales") +
  xlab("Date") +
  ylab("Vehicle Sales") +
  theme_bw()

# Display the naive forecast with drift plot
naive_drift_forecast_plot

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).

The naive forecast with drift extends the most recent trend into the future, which aligns with the historical pattern observed in vehicle sales. This method incorporates the natural increase in sales over time rather than assuming a flat forecast.
The naive forecast with drift does not capture seasonality explicitly. If strong seasonal patterns were present (e.g., higher sales during certain months), this method would not adequately account for that.
The 80% and 95% confidence intervals widen as we move forward in time, indicating increasing uncertainty in predictions. This reflects real-world unpredictability in vehicle sales.

Naive Forecast with drift is suitable for data with a long-term trend (as seen in vehicle sales). It captures the general direction of the time series better than a naive (flat) forecast. However, it does not model seasonality, which could be important for industries affected by time-based patterns. Also, it does not react to external shocks (e.g., economic downturns, policy changes, or supply chain disruptions).