Assignment 2

Author

Farzana

1 Section1: Dataset Overview and Context

1.1 Overview

The dataset tracks monthly total vehicle sales in the United States from 1976 to 2024. The data shows clear seasonal patterns with peaks typically occurring in March-June and troughs in January-February. Vehicle sales are influenced by economic conditions (GDP, interest rates, employment), consumer confidence, manufacturer incentives, and seasonal buying patterns. The series exhibits both cyclical behavior aligned with economic cycles and structural changes like the 2008 financial crisis and 2020 pandemic disruptions.

Forecasting vehicle sales presents moderate difficulty. While strong seasonality and economic relationships provide useful signals, the series is subject to unpredictable shocks (oil prices, supply chain disruptions) and changing consumer preferences. Long-term forecasting is complicated by industry transformations like the shift toward electric vehicles and evolving mobility trends.

2 Exploratory Data Analysis

2.1 Time Series Visualization

Show code

library(dplyr)
library(ggplot2)
library(tidyr)
library(kableExtra)
library(zoo)
library(forecast)
library(tseries)
library(lubridate)

vehicle_sales <- read.csv("total_vehicle_sales.csv")
vehicle_sales$date <- as.Date(vehicle_sales$date)

# time series object
vehicle_ts <- ts(vehicle_sales$vehicle_sales, start = c(1976, 1), frequency = 12)

# Basic time series plot
ggplot(vehicle_sales, aes(x = date, y = vehicle_sales)) +
  geom_line(color = "#2C3E50") +
  labs(title = "US Vehicle Sales Over Time",
       x = "Year",
       y = "Total Sales",
       caption = "Source: Your Data Source") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

Show code

# Density plot
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
  geom_density(fill = "#3498DB", alpha = 0.7) +
  labs(title = "Distribution of Vehicle Sales",
       x = "Sales Volume",
       y = "Density") +
  theme_minimal()

Show code

# Monthly boxplot
vehicle_sales$month <- format(vehicle_sales$date, "%b")
vehicle_sales$month <- factor(vehicle_sales$month, levels = month.abb)

ggplot(vehicle_sales, aes(x = month, y = vehicle_sales)) +
  geom_boxplot(fill = "#3498DB", alpha = 0.7) +
  labs(title = "Monthly Distribution of Vehicle Sales",
       x = "Month",
       y = "Sales Volume") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45))

Show code

# Calculate summary statistics
summary_stats <- data.frame(
  Metric = c("Number of Observations",
             "Mean",
             "Median",
             "Standard Deviation",
             "Minimum",
             "Maximum",
             "1st Quartile",
             "3rd Quartile"),
  Value = c(length(vehicle_ts),
            mean(vehicle_ts),
            median(vehicle_ts),
            sd(vehicle_ts),
            min(vehicle_ts),
            max(vehicle_ts),
            quantile(vehicle_ts, 0.25),
            quantile(vehicle_ts, 0.75))
)

# Create formatted table
kable(summary_stats, 
      caption = "Summary Statistics of Vehicle Sales",
      format = "html",
      digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Summary Statistics of Vehicle Sales
Metric	Value
Number of Observations	587.00
Mean	1263.82
Median	1271.20
Standard Deviation	220.79
Minimum	670.47
Maximum	1845.71
1st Quartile	1119.52
3rd Quartile	1421.77

Central Tendency: The similarity between the mean and median suggests the data has a symmetric distribution without significant skewness. With a central tendency in the range of 1263–1271, this points to a consistent and stable sales volume over the years.
Variability: A standard deviation of 220.79 indicates a moderate variation in sales values, with the majority of months staying within one standard deviation above or below the mean.
Range: The minimum value of 670.47 reflects lower sales during challenging economic times or off-peak seasons, while the maximum value of 1845.71 represents sales at their peak during high-demand seasons or favorable economic conditions.
Quartiles: “The interquartile range (IQR = 1421.77 - 1119.52 = 302.25) reflects moderate variability in the middle 50% of the sales data. This range captures the typical month-to-month fluctuations in vehicle sales, excluding any extreme outliers or unusual events.

Outliers: Sales values outside the minimum and maximum are likely tied to major economic events, such as financial crises or periods of economic boom.
Seasonality: The close alignment of the quartiles, mean, and median further reinforces the existence of cyclical and seasonal patterns in the data.
Challenges in Forecasting: The variability indicated by the standard deviation, along with occasional extremes in the data, makes forecasting somewhat challenging, particularly when factoring in external shocks.

3 Section 3: Time Series Components Analysis

3.1 Moving Average Analysis

Show code

# Calculate 12-month moving average
ma_12 <- rollmean(vehicle_ts, k = 12, align = "center")

# Create data frame for plotting
ma_df <- data.frame(
  date = vehicle_sales$date[6:(length(vehicle_ts)-6)],
  original = vehicle_ts[6:(length(vehicle_ts)-6)],
  ma = ma_12
)

# Plot original series with moving average
ggplot(ma_df, aes(x = date)) +
  geom_line(aes(y = original, color = "Original"), alpha = 0.7) +
  geom_line(aes(y = ma, color = "12-Month Moving Average"), size = 1) +
  scale_color_manual(values = c("Original" = "#2C3E50", 
                               "12-Month Moving Average" = "#E74C3C")) +
  labs(title = "Vehicle Sales with 12-Month Moving Average",
       x = "Year",
       y = "Sales Volume",
       color = "Series") +
  theme_minimal()

Show code

# Calculate and plot remainder series
ma_df$remainder <- ma_df$original - ma_df$ma

ggplot(ma_df, aes(x = date, y = remainder)) +
  geom_line(color = "#2C3E50") +
  labs(title = "Remainder Series (Original - Moving Average)",
       x = "Year",
       y = "Remainder") +
  theme_minimal()

3.2 Seasonality Analysis

Show code

# Decompose time series
decomp <- decompose(vehicle_ts, type = "multiplicative")

# Plot decomposition
autoplot(decomp) +
  theme_minimal() +
  labs(title = "Multiplicative Time Series Decomposition")

Show code

# Seasonal plot
ggseasonplot(vehicle_ts, 
             year.labels = TRUE, 
             year.labels.left = TRUE) +
  theme_minimal() +
  labs(title = "Seasonal Plot of Vehicle Sales",
       x = "Month",
       y = "Sales Volume")

3.2.1 Observations

Cyclicality: Both the moving average and the remainder series clearly emphasize strong cyclical patterns.
Seasonality: Patterns in the moving average indicate seasonality, though further decomposition or spectral analysis is needed to confirm its intensity.
Forecasting Challenges: The remainder series suggests that unexpected short-term events, such as shocks or anomalies, could pose challenges for accurate forecasting.

4 Section 4: Naive Forecasting

4.1 Forecasting Results

Show code

# Create seasonal naive forecast
forecast_length <- 6
snaive_forecast <- snaive(vehicle_ts, h = forecast_length)

# Plot forecast
autoplot(snaive_forecast) +
  theme_minimal() +
  labs(title = "6-Period Seasonal Naive Forecast",
       x = "Year",
       y = "Sales Volume") +
  guides(colour = guide_legend(title = "Series"))

Show code

# Calculate accuracy metrics
accuracy_metrics <- accuracy(snaive_forecast)
kable(accuracy_metrics, 
      caption = "Forecast Accuracy Metrics",
      format = "html",
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Forecast Accuracy Metrics
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
Training set	5.285	157.667	114.406	-0.498	9.701	1	0.625

Show code

# Compare with simple naive forecast
naive_forecast <- naive(vehicle_ts, h = forecast_length)
naive_accuracy <- accuracy(naive_forecast)

# Compare both approaches
forecast_comparison <- autoplot(vehicle_ts) +
  autolayer(naive_forecast, series = "Naive", PI = FALSE) +
  autolayer(snaive_forecast, series = "Seasonal Naive", PI = FALSE) +
  theme_minimal() +
  labs(title = "Comparison of Naive and Seasonal Naive Forecasts",
       x = "Year",
       y = "Sales Volume")

print(forecast_comparison)

4.1.1 Analysis of Forecasting Results

6-Period Seasonal Naive Forecast
- This plot displays the predicted sales for the next six periods using a seasonal naïve model.
- The model accurately captures recurring seasonal trends by relying on the most recent seasonal values for its predictions.
- The forecast closely aligns with the observed seasonal patterns, especially during months of high and low demand, showcasing the model’s effectiveness for this dataset’s strong seasonality..
Comparison of Naive and Seasonal Naive Forecasts
- The naive forecast assumes the most recent data point remains constant, disregarding any seasonal effects.
- In contrast, the seasonal naïve forecast accounts for recurring seasonal patterns, producing predictions that more accurately reflect observed trends.
- This comparison underscores the seasonal naïve forecast’s strength in delivering more realistic short-term predictions, especially for datasets with strong

--- title: "Assignment 2" author: "Farzana" format: html: code-fold: true code-summary: "Show code" code-tools: true toc: true toc-depth: 2 toc-location: left theme: cosmo css: styles.css self-contained: true embed-resources: true number-sections: true html-math-method: katex --- ```{r} #| include: false knitr::opts_chunk$set( fig.path = "figures/", fig.width = 10, fig.height = 6, fig.retina = 2, out.width = "100%", cache = TRUE ) ``` # Section1: Dataset Overview and Context {.tabset} ## Overview The dataset tracks monthly total vehicle sales in the United States from 1976 to 2024. The data shows clear seasonal patterns with peaks typically occurring in March-June and troughs in January-February. Vehicle sales are influenced by economic conditions (GDP, interest rates, employment), consumer confidence, manufacturer incentives, and seasonal buying patterns. The series exhibits both cyclical behavior aligned with economic cycles and structural changes like the 2008 financial crisis and 2020 pandemic disruptions. Forecasting vehicle sales presents moderate difficulty. While strong seasonality and economic relationships provide useful signals, the series is subject to unpredictable shocks (oil prices, supply chain disruptions) and changing consumer preferences. Long-term forecasting is complicated by industry transformations like the shift toward electric vehicles and evolving mobility trends. # Exploratory Data Analysis {.tabset} ## Time Series Visualization ::: {.panel-tabset} ### Line Chart ```{r} #| label: setup #| warning: false #| message: false library(dplyr) library(ggplot2) library(tidyr) library(kableExtra) library(zoo) library(forecast) library(tseries) library(lubridate) vehicle_sales <- read.csv("total_vehicle_sales.csv") vehicle_sales$date <- as.Date(vehicle_sales$date) # time series object vehicle_ts <- ts(vehicle_sales$vehicle_sales, start = c(1976, 1), frequency = 12) # Basic time series plot ggplot(vehicle_sales, aes(x = date, y = vehicle_sales)) + geom_line(color = "#2C3E50") + labs(title = "US Vehicle Sales Over Time", x = "Year", y = "Total Sales", caption = "Source: Your Data Source") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) ``` ### Density Plot ```{r} # Density plot ggplot(vehicle_sales, aes(x = vehicle_sales)) + geom_density(fill = "#3498DB", alpha = 0.7) + labs(title = "Distribution of Vehicle Sales", x = "Sales Volume", y = "Density") + theme_minimal() ``` ### Monthly Boxplot ```{r} # Monthly boxplot vehicle_sales$month <- format(vehicle_sales$date, "%b") vehicle_sales$month <- factor(vehicle_sales$month, levels = month.abb) ggplot(vehicle_sales, aes(x = month, y = vehicle_sales)) + geom_boxplot(fill = "#3498DB", alpha = 0.7) + labs(title = "Monthly Distribution of Vehicle Sales", x = "Month", y = "Sales Volume") + theme_minimal() + theme(axis.text.x = element_text(angle = 45)) ``` ### Summary Statistics ```{r} #| label: summary-stats #| warning: false #| message: false # Calculate summary statistics summary_stats <- data.frame( Metric = c("Number of Observations", "Mean", "Median", "Standard Deviation", "Minimum", "Maximum", "1st Quartile", "3rd Quartile"), Value = c(length(vehicle_ts), mean(vehicle_ts), median(vehicle_ts), sd(vehicle_ts), min(vehicle_ts), max(vehicle_ts), quantile(vehicle_ts, 0.25), quantile(vehicle_ts, 0.75)) ) # Create formatted table kable(summary_stats, caption = "Summary Statistics of Vehicle Sales", format = "html", digits = 2) %>% kable_styling(bootstrap_options = c("striped", "hover")) ``` ### Understanding Summary Statistics - **Central Tendency**: The similarity between the mean and median suggests the data has a symmetric distribution without significant skewness. With a central tendency in the range of 1263–1271, this points to a consistent and stable sales volume over the years. - **Variability**: A standard deviation of 220.79 indicates a moderate variation in sales values, with the majority of months staying within one standard deviation above or below the mean. - **Range**: The minimum value of 670.47 reflects lower sales during challenging economic times or off-peak seasons, while the maximum value of 1845.71 represents sales at their peak during high-demand seasons or favorable economic conditions. - **Quartiles**: "The interquartile range (IQR = 1421.77 - 1119.52 = 302.25) reflects moderate variability in the middle 50% of the sales data. This range captures the typical month-to-month fluctuations in vehicle sales, excluding any extreme outliers or unusual events. ### Initial Analysis - **Outliers**: Sales values outside the minimum and maximum are likely tied to major economic events, such as financial crises or periods of economic boom. - **Seasonality**: The close alignment of the quartiles, mean, and median further reinforces the existence of cyclical and seasonal patterns in the data. - **Challenges in Forecasting**: The variability indicated by the standard deviation, along with occasional extremes in the data, makes forecasting somewhat challenging, particularly when factoring in external shocks. ::: # Section 3: Time Series Components Analysis {.tabset} ## Moving Average Analysis ```{r} #| label: moving-average #| warning: false #| message: false # Calculate 12-month moving average ma_12 <- rollmean(vehicle_ts, k = 12, align = "center") # Create data frame for plotting ma_df <- data.frame( date = vehicle_sales$date[6:(length(vehicle_ts)-6)], original = vehicle_ts[6:(length(vehicle_ts)-6)], ma = ma_12 ) # Plot original series with moving average ggplot(ma_df, aes(x = date)) + geom_line(aes(y = original, color = "Original"), alpha = 0.7) + geom_line(aes(y = ma, color = "12-Month Moving Average"), size = 1) + scale_color_manual(values = c("Original" = "#2C3E50", "12-Month Moving Average" = "#E74C3C")) + labs(title = "Vehicle Sales with 12-Month Moving Average", x = "Year", y = "Sales Volume", color = "Series") + theme_minimal() # Calculate and plot remainder series ma_df$remainder <- ma_df$original - ma_df$ma ggplot(ma_df, aes(x = date, y = remainder)) + geom_line(color = "#2C3E50") + labs(title = "Remainder Series (Original - Moving Average)", x = "Year", y = "Remainder") + theme_minimal() ``` ## Seasonality Analysis ```{r} #| label: seasonality #| warning: false #| message: false # Decompose time series decomp <- decompose(vehicle_ts, type = "multiplicative") # Plot decomposition autoplot(decomp) + theme_minimal() + labs(title = "Multiplicative Time Series Decomposition") # Seasonal plot ggseasonplot(vehicle_ts, year.labels = TRUE, year.labels.left = TRUE) + theme_minimal() + labs(title = "Seasonal Plot of Vehicle Sales", x = "Month", y = "Sales Volume") ``` ### Observations - **Cyclicality**: Both the moving average and the remainder series clearly emphasize strong cyclical patterns. - **Seasonality**: Patterns in the moving average indicate seasonality, though further decomposition or spectral analysis is needed to confirm its intensity. - **Forecasting Challenges**: The remainder series suggests that unexpected short-term events, such as shocks or anomalies, could pose challenges for accurate forecasting. # Section 4: Naive Forecasting {.tabset} ## Forecasting Results ```{r} #| label: naive-forecast #| warning: false #| message: false # Create seasonal naive forecast forecast_length <- 6 snaive_forecast <- snaive(vehicle_ts, h = forecast_length) # Plot forecast autoplot(snaive_forecast) + theme_minimal() + labs(title = "6-Period Seasonal Naive Forecast", x = "Year", y = "Sales Volume") + guides(colour = guide_legend(title = "Series")) # Calculate accuracy metrics accuracy_metrics <- accuracy(snaive_forecast) kable(accuracy_metrics, caption = "Forecast Accuracy Metrics", format = "html", digits = 3) %>% kable_styling(bootstrap_options = c("striped", "hover")) # Compare with simple naive forecast naive_forecast <- naive(vehicle_ts, h = forecast_length) naive_accuracy <- accuracy(naive_forecast) # Compare both approaches forecast_comparison <- autoplot(vehicle_ts) + autolayer(naive_forecast, series = "Naive", PI = FALSE) + autolayer(snaive_forecast, series = "Seasonal Naive", PI = FALSE) + theme_minimal() + labs(title = "Comparison of Naive and Seasonal Naive Forecasts", x = "Year", y = "Sales Volume") print(forecast_comparison) ``` ### Analysis of Forecasting Results 1. **6-Period Seasonal Naive Forecast** - This plot displays the predicted sales for the next six periods using a seasonal naïve model. - The model accurately captures recurring seasonal trends by relying on the most recent seasonal values for its predictions. - The forecast closely aligns with the observed seasonal patterns, especially during months of high and low demand, showcasing the model's effectiveness for this dataset's strong seasonality.. 2. **Comparison of Naive and Seasonal Naive Forecasts** - The naive forecast assumes the most recent data point remains constant, disregarding any seasonal effects. - In contrast, the seasonal naïve forecast accounts for recurring seasonal patterns, producing predictions that more accurately reflect observed trends. - This comparison underscores the seasonal naïve forecast's strength in delivering more realistic short-term predictions, especially for datasets with strong