BANA 7050: Assignment 2_Keerthi Chereddy

Section 1

The dataset represents the total vehicle sales in the USA (“TOTALNSA” series) published by the Federal Reserve Bank of St. Louis (FRED). This time series data, updated monthly, serves as a critical economic indicator, reflecting consumer spending trends and the trends in the automotive industry. The dataset spans several years, offering a holistic view of vehicle sales dynamics.

Vehicle sales are very closely influenced by macro economic trends and hence fluctuations in income levels, economic policies etc. can lead to significant variability in the data. Several factors such as Government policies, environmental regulations, shift in consumer preferences, new technologies (Electric Vehicles) can have a direct impact on the industry. Vehicle sales data exhibits seasonality such as year-end sales traditionally showing higher sales figures

The impact of all the above factors and their interplay makes forecasting this time series challenging

Section 2

# Line chart
ggplot(vehicle_sales, aes(x = date, y = vehicle_sales, group = 1)) +
  geom_line() +
  labs(title = "Time Series of Vehicle Sales", x = "Date", y = "Total Sales")

The sales seem to be relatively stable over time and no major upward/ downward trends. There is a clear pattern of peaks and troughs that suggests seasonality in the data. There are a few outliers (spikes and dips) indicating extraordinary events that significantly affected vehicle sales, such as economic shocks, policy changes, or supply chain disruptions.

# Density Plot
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Density Plot of Vehicle Sales", x = "Total Sales", y = "Density") +
  theme_minimal()

The density plot for vehicle sales shows a unimodal distribution with a peak around mid-range of total sales. This indicates that the sales data tends to cluster around a central value, with fewer occurrences of extremely low or high sales. The symmetry of the plot suggests that the distribution of sales is relatively balanced, without a significant skew in either direction.

# Histogram
ggplot(vehicle_sales, aes(x = vehicle_sales)) +
  geom_histogram(binwidth = 50, fill = "blue", color = "black") + # Adjust binwidth as needed
  labs(title = "Histogram of Vehicle Sales", x = "Total Sales", y = "Count") +
  theme_minimal()

The histogram of vehicle sales indicates a multimodal distribution with several peaks. This suggests the presence of multiple sales behaviors within the time series. It also points to potential underlying factors influencing sales volumes at different times.

# Boxplot
ggplot(vehicle_sales, aes(y = vehicle_sales)) +
  geom_boxplot(fill = "blue", color = "black") +
  labs(title = "Boxplot of Vehicle Sales", x = "", y = "Total Sales") +
  theme_minimal()

The boxplot for vehicle sales shows a relatively concentrated interquartile range, indicating that the middle 50% of sales data is clustered within a narrow range of values. The presence of long tails (whiskers) on both ends suggests there are periods with significantly lower and higher sales, but these are outliers compared to the majority of the data. The symmetry of the box and whiskers indicates an approximately even distribution of data around the median, without a pronounced skew.

# Summary Statistics
summary_stats <- vehicle_sales %>%
  summarise(
    Count = n(),
    Mean = mean(vehicle_sales, na.rm = TRUE),
    Median = median(vehicle_sales, na.rm = TRUE),
    Mode = as.numeric(names(sort(table(vehicle_sales), decreasing = TRUE)[1])), # Mode is less common in continuous data
    SD = sd(vehicle_sales, na.rm = TRUE),
    Range = max(vehicle_sales, na.rm = TRUE) - min(vehicle_sales, na.rm = TRUE),
    IQR = IQR(vehicle_sales, na.rm = TRUE)
  )
summary_stats

##   Count     Mean   Median  Mode       SD    Range     IQR
## 1   575 1261.744 1268.455 839.3 221.9962 1175.247 302.971

The vehicle sales time series shows seasonality, indicated by regular peaks and troughs, and volatility, as evidenced by a high standard deviation and outliers identified in the boxplot. The data is multimodal, suggesting different sales patterns or regimes, and the close proximity of the mean and median sales figures suggests a symmetric distribution around a central level of sales. These characteristics imply that while seasonal trends may be predictable, the impact of economic events and policy changes introduces complexity into forecasting efforts.

Section 3

# Calculate the moving average
window_size <- 12  # for a 12-month moving average
vehicle_sales <- vehicle_sales %>%
  arrange(date) %>%  # Make sure the data is sorted by date
  mutate(MovingAverage = rollmean(vehicle_sales, k = window_size, fill = NA, align = 'center'))

# Plot the time series with the moving average
ggplot(vehicle_sales, aes(x = date, group = 1)) +
  geom_line(aes(y = vehicle_sales), colour = "blue") +
  geom_line(aes(y = MovingAverage), colour = "red", na.rm = TRUE) +
  labs(title = "Vehicle Sales with 12-Month Moving Average", x = "Date", y = "Total Sales") +
  theme_minimal()

# Calculate the remainder series by subtracting the moving average from the original series
vehicle_sales$Remainder <- vehicle_sales$vehicle_sales - vehicle_sales$MovingAverage

# Plot the remainder series
ggplot(vehicle_sales, aes(x = date, group = 1)) +
  geom_line(aes(y = Remainder), colour = "green", na.rm = TRUE) +
  labs(title = "Remainder Series after Removing Moving Average", x = "Date", y = "Remainder") +
  theme_minimal()

Subtracting the moving average from the original time series reveals the fluctuations that the moving average does not capture. These could represent short-term irregularities, noise, or more complex seasonal patterns not aligned with the period of the moving average used.

Section 4

# Check if the vehicle_sales dataframe has the correct columns and data
if ("vehicle_sales" %in% names(vehicle_sales) && "date" %in% names(vehicle_sales) && nrow(vehicle_sales) > 0) {
  # Assuming vehicle_sales$Date is in the proper Date format
  vehicle_sales_ts <- ts(vehicle_sales$vehicle_sales, frequency = 12, start = c(year(min(vehicle_sales$date)), month(min(vehicle_sales$date))))
  
  # Perform STL decomposition
  vehicle_sales_stl <- stl(vehicle_sales_ts, s.window = "periodic")

  # Plot the components
  autoplot(vehicle_sales_stl) + labs(title = "STL Decomposition of Vehicle Sales")
} else {
  stop("The 'Sales' and/or 'Date' column is missing, or there is no data in the dataframe.")
}

The fluctuations in the trend line suggest changes in the underlying sales levels over the long term. The seasonal component shows a consistent repeating pattern each year, which suggests a strong seasonal influence. The amplitude of the seasonal component is quite noticeable, implying that the seasonality has a significant impact on sales.

The remainder, or residuals, are relatively small compared to the seasonal fluctuations, indicating that the STL decomposition has captured most of the systematic behavior in the data.

# Create a naive seasonal forecast
naive_seasonal_forecast <- snaive(vehicle_sales_ts, h = 6)

# Create a naive forecast with drift
naive_drift_forecast <- rwf(vehicle_sales_ts, h = 6, drift = TRUE)

# Plot the original time series and the forecasts
autoplot(vehicle_sales_ts) +
  autolayer(naive_seasonal_forecast, series = "Naive Seasonal Forecast", PI = FALSE) +
  autolayer(naive_drift_forecast, series = "Naive Drift Forecast", PI = FALSE) +
  labs(title = "Naive Forecasts for Vehicle Sales", x = "Time", y = "Sales") +
  guides(colour=guide_legend(title="Legend")) +
  theme_minimal()

As the analysis suggest a strong seasonality, the naive seasonal forecast is expected to perform well in representing the behavior of the data.

BANA 7050: Assignment 2_Keerthi Chereddy

2024-01-22

Section 1

Section 2

Section 3

Section 4