BANA 7050: Time Series Data Description, Exploratory Analysis, and Decomposition

Author

Saurabh Verma

Section 1

In one or two paragraphs, describe the time series dataset you have been assigned, including its source and how the data are published.

Discuss some potential explanations of the data-geneating process based on what you know. What drives variation in the variable? Why or why not do you think this might be an easy/difficult variable to forecast?

Read Libraries

Code

suppressWarnings({
  suppressMessages({
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(forecast)
library(modeltime)
library(fabletools)
library(feasts)
library(tsibble)
library(tibble)
library(fable)


options(warn=-1)

})
})

Plot Avg Milk data

Code

milk_df <- read.csv('/Users/saurabhverma/Desktop/UC BANA/Spring 2024 Flex I/Forecasting/milk_price.csv')
#head(milk_df)
#str(milk_df)
milk_df$date <- ymd(milk_df$date)

ggplot(milk_df, aes(x = date, y = milk_price)) +
  geom_line(color = "#0099f9", size = 1.4) + 
  theme_classic() + 
  theme(
    axis.text = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 15),
    plot.title = element_text(size = 18, face = "bold")
  ) + 
  labs(
    title = "Avg Milk Price Dataset",
    x = "Date",
    y = "Avg Milk Price"
  )

Data Description

The provided data describes the average cost ($) per gallon of fresh, whole, and fortified milk in U.S. cities. The data is sourced from the U.S. Bureau of Labor Statistics, indicating a comprehensive representation of pricing trends for this type of milk across various urban areas in the United States.

Link - https://fred.stlouisfed.org/series/APU0000709112#

The data timeline spans from July 1995 to November 2023, with monthly data points included in the dataset. This timeline covers a substantial period, allowing for the analysis of long-term trends and patterns in milk pricing across different months over almost three decades.

The average price of fresh, whole, and fortified milk in U.S. cities is influenced by factors such as supply and demand dynamics, production costs, market competition, economic conditions, seasonal variations, and government policies.

Forecasting this variable is challenging due to its complexity, sensitivity to external shocks, and the dynamic nature of the dairy market. Historical patterns and market research can aid in forecasting accuracy, but the multifaceted influences make it an intriguing yet complex variable to predict.

Section 2

In the next section, you should develop a number of visualizations that describe the behavior of your time-series dataset. These should include, but are not limited to, line charts, density plots, histograms, and boxplots. Each plot should be publication-quality with a title and informative labels. Each visualiation should be described in the text. The goal is to provide enough context that the reader understands the behavior of the time-series and its characteristics.

You will also create a publication-quality table with various summary statistics of the data, to include standard metrics such as the number of observations of the time series, its mean, median, and mode, as well as standard deviation, range, etc.

Complete an initial analysis of the data. Are there any clear outliers? Any obvious patterns that you can describe based on what you know about the data-generating process?

Code

# Line chart
ggplot(milk_df, aes(x = date, y = milk_price)) +
  geom_line(color = "blue") +
  labs(title = "Average Milk Prices Over Time",
       x = "Date",
       y = "Milk Price")

The line chart indicates a rising trend in the average milk prices over the years. Furthermore, there is a notable spike in the average milk prices in the more recent years.

Code

# Density plot
ggplot(milk_df, aes(x = milk_price)) +
  geom_density(fill = "skyblue", color = "blue") +
  labs(title = "Density Plot of Average Milk Prices",
       x = "Milk Price",
       y = "Density")

The distribution of the average price, as depicted by the density curve, exhibits a bimodal pattern with two distinct peaks. Additionally, there is a subtle right right skew, indicating that the mean value is greater than the median. This suggests a distribution with a tendency towards higher values and asymmetry towards the right side.

Code

# Histogram
ggplot(milk_df, aes(x = milk_price)) +
  geom_histogram(fill = "skyblue", color = "blue", bins = 20) +
  labs(title = "Histogram of Average Milk Prices",
       x = "Milk Price",
       y = "Frequency")

Over the majority of the observed period, the milk prices have generally hovered around the $2.8 range, occasionally experiencing increases to approximately $3.2 and $3.3. The distribution of prices appears to be sightly skewed towards right, indicating that higher prices occur less frequently while the majority of instances are concentrated around the lower values.

Code

# Boxplot
ggplot(milk_df, aes(y = milk_price)) +
  geom_boxplot(fill = "skyblue", color = "blue") +
  labs(title = "Boxplot of Average Milk Prices",
       x = "",
       y = "Milk Price")

In the box plot, no outliers are observed, and a slight rightward skewness is apparent.

Summary Statistics

Code

# Calculate summary statistics
summary_stats <- summary(milk_df$milk_price)

# Create a table
summary_table <- data.frame(
  Metric = c("Number of Observations", "Mean", "Median", "Mode", "Standard Deviation", "Range"),
  Value = c(length(milk_df$milk_price), mean(milk_df$milk_price), median(milk_df$milk_price),
            table(milk_df$milk_price)[which.max(table(milk_df$milk_price))],
            sd(milk_df$milk_price), diff(range(milk_df$milk_price)))
)

# Print the table
print(summary_table)

                  Metric       Value
1 Number of Observations 341.0000000
2                   Mean   3.2004428
3                 Median   3.1780000
4                   Mode   4.0000000
5     Standard Deviation   0.4241547
6                  Range   1.7590000

Outlier Detetction

Code

# Calculate Z-scores
z_scores <- scale(milk_df$milk_price)

# Identify outliers based on Z-scores
outliers <- abs(z_scores) > 2  # Adjust the threshold as needed

# Display the identified outliers
print(milk_df[outliers, ])

          date milk_price
323 2022-05-01      4.204
324 2022-06-01      4.153
325 2022-07-01      4.156
326 2022-08-01      4.194
327 2022-09-01      4.181
328 2022-10-01      4.184
329 2022-11-01      4.218
330 2022-12-01      4.211
331 2023-01-01      4.204
332 2023-02-01      4.163
333 2023-03-01      4.098

based on z score these 10 data points are the outliers

Section 3

Visualize a moving average of the time series variable and describe any insights gained from this. Then, visualize the “remainder” series calculated by subtracting the moving average from the original time series. Do there appear to be other patterns in the data that are not captured by the moving average?

Assess whether your time series contains seasonality using a time series decomposition and visualizations discussed in class. Does the seasonality appear to be strong, weak, or absent entirely? Does this match your expectations? Include visualizations as necessary, as well as a text explanation of your findings.

Code

suppressWarnings({
  suppressMessages({
# Function to calculate moving average
calculate_moving_average <- function(data, window_size) {
  data %>%
    mutate(moving_avg = zoo::rollmean(milk_price, k = window_size, fill = NA, align = "right"))
}

# Function to calculate remainder series
calculate_remainder <- function(data, window_size) {
  data %>%
    mutate(remainder = milk_price - zoo::rollmean(milk_price, k = window_size, fill = NA, align = "right"))
}

# Choose a window size
window_size <- 24  # Adjust as needed

# Calculate moving average and remainder
milk_df <- milk_df %>%
  calculate_moving_average(window_size) %>%
  calculate_remainder(window_size)

# Create a plot for moving average
moving_avg_plot <- ggplot(milk_df, aes(x = date, y = moving_avg)) +
  geom_line(color = "blue") +
  labs(title = paste("Moving Average (Window Size:", window_size, ")"),
       x = "Date",
       y = "Milk Price") +
  theme_minimal()

# Create a plot for remainder series
remainder_plot <-  ggplot(milk_df, aes(x = date, y = remainder)) +
  geom_line(color = "red") +
  labs(title = paste("Remainder Series (Original - Moving Average)"),
       x = "Date",
       y = "Remainder") +
  theme_minimal()

# Display both plots
grid.arrange(moving_avg_plot, remainder_plot, ncol = 1)

})
})

The remainder plot represents the seasonality component that remains unaccounted for by the moving average. To capture this seasonality, it is necessary to conduct time series decomposition.

Code

# Create a time series object
ts_data <- ts(milk_df$milk_price, frequency = 12, start = c(year(milk_df$date[1]), month(milk_df$date[1])))  

# Perform time series decomposition
decomposition <- stl(ts_data, s.window = "periodic")

# Plot the decomposition
plot(decomposition)

Code

# Extract components
trend <- decomposition$time.series[, "trend"]
seasonal <- decomposition$time.series[, "seasonal"]
remainder <- decomposition$time.series[, "remainder"]

# Create plots for each component
ggplot() +
  geom_line(aes(x = milk_df$date, y = trend), color = "blue") +
  labs(title = "Trend Component",
       x = "Date",
       y = "Trend") +
  theme_minimal()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.

Code

ggplot() +
  geom_line(aes(x = milk_df$date, y = seasonal), color = "green") +
  labs(title = "Seasonal Component",
       x = "Date",
       y = "Seasonal") +
  theme_minimal()

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.

Code

ggplot() +
  geom_line(aes(x = milk_df$date, y = remainder), color = "red") +
  labs(title = "Remainder Component",
       x = "Date",
       y = "Remainder")

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.

Following the time series decomposition, the analysis reveals that the seasonality component is effectively captured. Moreover, the seasonality component exhibits strength with a yearly period and possesses an additive nature.

Section 4

Finally, conduct a 6 time period naive forecast of your time series. If it contains significant seasonality, be sure to account for this. If you feel a naive forecast with drift is a better candidate, implement this approach and justify your decision. Does the naive forecast seem to do a good job of representing the behavior of the data? Why or why not? Include visualizations of the forecast and the original time series as appropriate.

Code

suppressWarnings({
    suppressMessages({

t_data <- milk_df %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)


t_data %>%
  model(
    Mean = MEAN(milk_price)
  ) %>%
  forecast(h = 12) %>%
  autoplot(t_data, level = NULL,size=1) +
  geom_vline(aes(xintercept = ymd("2023-12-01")), color = "red", linetype = "dashed") +
  theme_bw() +
  ylab('Avg Milk Price') + 
    ggtitle('Mean of series')


})
})

Code

t_data %>%
  model(
    Naive = NAIVE(milk_price)
  ) %>%
  forecast(h = 6) %>%
  autoplot(t_data, level = NULL,size=1) +
  geom_vline(aes(xintercept = ymd("2023-12-01")), color = "red", linetype = "dashed") +
  theme_bw()+
  ylab('Avg Milk Price') +
  ggtitle('Naive Forecast')

Code

suppressWarnings({
  suppressMessages({
drift_lm_data = t_data %>%
  filter(date == ymd(c('1995-07-01','2023-10-01')))

drift_lm = lm(data = drift_lm_data,milk_price~date)

drift_lm_pred = t_data %>%
  mutate(pred = predict(drift_lm,newdata=t_data))


t_data %>%
  model(
    Naive = NAIVE(milk_price~drift())
  ) %>%
  forecast(h = 6) %>%
  autoplot(milk_df, level = NULL,size=1,  title = 'Naive Forecast with Drift') +
  geom_vline(aes(xintercept = ymd("2023-11-01")), color = "red", linetype = "dashed") +
  geom_line(data=drift_lm_pred,aes(date,pred),color='blue',linetype='dashed')+
  theme_bw()+
  ylab('Avg Milk Pric') +
  ggtitle('Naive Forecast with Drift')

})
})

For the forecasting task, I conducted analyses using the Mean of the series, Naive, and Naive with drift methods. The Naive with drift approach exhibited a noteworthy forecasting outcome among the methods tested. However, it may not be the optimal approach for accurately representing the data’s behavior. Alternative methods such as SARIMA or ETS models could be more suitable, as they incorporate seasonality and trend components for forecasting. Currently, these components are not utilized, despite their presence in the data.