BANA 7050: Time Series Data Description, Exploratory Analysis, and Decomposition
Author
Saurabh Verma
Section 1
In one or two paragraphs, describe the time series dataset you have been assigned, including its source and how the data are published.
Discuss some potential explanations of the data-geneating process based on what you know. What drives variation in the variable? Why or why not do you think this might be an easy/difficult variable to forecast?
The provided data describes the average cost ($) per gallon of fresh, whole, and fortified milk in U.S. cities. The data is sourced from the U.S. Bureau of Labor Statistics, indicating a comprehensive representation of pricing trends for this type of milk across various urban areas in the United States.
The data timeline spans from July 1995 to November 2023, with monthly data points included in the dataset. This timeline covers a substantial period, allowing for the analysis of long-term trends and patterns in milk pricing across different months over almost three decades.
The average price of fresh, whole, and fortified milk in U.S. cities is influenced by factors such as supply and demand dynamics, production costs, market competition, economic conditions, seasonal variations, and government policies.
Forecasting this variable is challenging due to its complexity, sensitivity to external shocks, and the dynamic nature of the dairy market. Historical patterns and market research can aid in forecasting accuracy, but the multifaceted influences make it an intriguing yet complex variable to predict.
Section 2
In the next section, you should develop a number of visualizations that describe the behavior of your time-series dataset. These should include, but are not limited to, line charts, density plots, histograms, and boxplots. Each plot should be publication-quality with a title and informative labels. Each visualiation should be described in the text. The goal is to provide enough context that the reader understands the behavior of the time-series and its characteristics.
You will also create a publication-quality table with various summary statistics of the data, to include standard metrics such as the number of observations of the time series, its mean, median, and mode, as well as standard deviation, range, etc.
Complete an initial analysis of the data. Are there any clear outliers? Any obvious patterns that you can describe based on what you know about the data-generating process?
Code
# Line chartggplot(milk_df, aes(x = date, y = milk_price)) +geom_line(color ="blue") +labs(title ="Average Milk Prices Over Time",x ="Date",y ="Milk Price")
The line chart indicates a rising trend in the average milk prices over the years. Furthermore, there is a notable spike in the average milk prices in the more recent years.
Code
# Density plotggplot(milk_df, aes(x = milk_price)) +geom_density(fill ="skyblue", color ="blue") +labs(title ="Density Plot of Average Milk Prices",x ="Milk Price",y ="Density")
The distribution of the average price, as depicted by the density curve, exhibits a bimodal pattern with two distinct peaks. Additionally, there is a subtle right right skew, indicating that the mean value is greater than the median. This suggests a distribution with a tendency towards higher values and asymmetry towards the right side.
Code
# Histogramggplot(milk_df, aes(x = milk_price)) +geom_histogram(fill ="skyblue", color ="blue", bins =20) +labs(title ="Histogram of Average Milk Prices",x ="Milk Price",y ="Frequency")
Over the majority of the observed period, the milk prices have generally hovered around the $2.8 range, occasionally experiencing increases to approximately $3.2 and $3.3. The distribution of prices appears to be sightly skewed towards right, indicating that higher prices occur less frequently while the majority of instances are concentrated around the lower values.
Code
# Boxplotggplot(milk_df, aes(y = milk_price)) +geom_boxplot(fill ="skyblue", color ="blue") +labs(title ="Boxplot of Average Milk Prices",x ="",y ="Milk Price")
In the box plot, no outliers are observed, and a slight rightward skewness is apparent.
Summary Statistics
Code
# Calculate summary statisticssummary_stats <-summary(milk_df$milk_price)# Create a tablesummary_table <-data.frame(Metric =c("Number of Observations", "Mean", "Median", "Mode", "Standard Deviation", "Range"),Value =c(length(milk_df$milk_price), mean(milk_df$milk_price), median(milk_df$milk_price),table(milk_df$milk_price)[which.max(table(milk_df$milk_price))],sd(milk_df$milk_price), diff(range(milk_df$milk_price))))# Print the tableprint(summary_table)
Metric Value
1 Number of Observations 341.0000000
2 Mean 3.2004428
3 Median 3.1780000
4 Mode 4.0000000
5 Standard Deviation 0.4241547
6 Range 1.7590000
Outlier Detetction
Code
# Calculate Z-scoresz_scores <-scale(milk_df$milk_price)# Identify outliers based on Z-scoresoutliers <-abs(z_scores) >2# Adjust the threshold as needed# Display the identified outliersprint(milk_df[outliers, ])
based on z score these 10 data points are the outliers
Section 3
Visualize a moving average of the time series variable and describe any insights gained from this. Then, visualize the “remainder” series calculated by subtracting the moving average from the original time series. Do there appear to be other patterns in the data that are not captured by the moving average?
Assess whether your time series contains seasonality using a time series decomposition and visualizations discussed in class. Does the seasonality appear to be strong, weak, or absent entirely? Does this match your expectations? Include visualizations as necessary, as well as a text explanation of your findings.
Code
suppressWarnings({suppressMessages({# Function to calculate moving averagecalculate_moving_average <-function(data, window_size) { data %>%mutate(moving_avg = zoo::rollmean(milk_price, k = window_size, fill =NA, align ="right"))}# Function to calculate remainder seriescalculate_remainder <-function(data, window_size) { data %>%mutate(remainder = milk_price - zoo::rollmean(milk_price, k = window_size, fill =NA, align ="right"))}# Choose a window sizewindow_size <-24# Adjust as needed# Calculate moving average and remaindermilk_df <- milk_df %>%calculate_moving_average(window_size) %>%calculate_remainder(window_size)# Create a plot for moving averagemoving_avg_plot <-ggplot(milk_df, aes(x = date, y = moving_avg)) +geom_line(color ="blue") +labs(title =paste("Moving Average (Window Size:", window_size, ")"),x ="Date",y ="Milk Price") +theme_minimal()# Create a plot for remainder seriesremainder_plot <-ggplot(milk_df, aes(x = date, y = remainder)) +geom_line(color ="red") +labs(title =paste("Remainder Series (Original - Moving Average)"),x ="Date",y ="Remainder") +theme_minimal()# Display both plotsgrid.arrange(moving_avg_plot, remainder_plot, ncol =1)})})
The remainder plot represents the seasonality component that remains unaccounted for by the moving average. To capture this seasonality, it is necessary to conduct time series decomposition.
Code
# Create a time series objectts_data <-ts(milk_df$milk_price, frequency =12, start =c(year(milk_df$date[1]), month(milk_df$date[1]))) # Perform time series decompositiondecomposition <-stl(ts_data, s.window ="periodic")# Plot the decompositionplot(decomposition)
Code
# Extract componentstrend <- decomposition$time.series[, "trend"]seasonal <- decomposition$time.series[, "seasonal"]remainder <- decomposition$time.series[, "remainder"]# Create plots for each componentggplot() +geom_line(aes(x = milk_df$date, y = trend), color ="blue") +labs(title ="Trend Component",x ="Date",y ="Trend") +theme_minimal()
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
Code
ggplot() +geom_line(aes(x = milk_df$date, y = seasonal), color ="green") +labs(title ="Seasonal Component",x ="Date",y ="Seasonal") +theme_minimal()
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
Code
ggplot() +geom_line(aes(x = milk_df$date, y = remainder), color ="red") +labs(title ="Remainder Component",x ="Date",y ="Remainder")
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
Following the time series decomposition, the analysis reveals that the seasonality component is effectively captured. Moreover, the seasonality component exhibits strength with a yearly period and possesses an additive nature.
Section 4
Finally, conduct a 6 time period naive forecast of your time series. If it contains significant seasonality, be sure to account for this. If you feel a naive forecast with drift is a better candidate, implement this approach and justify your decision. Does the naive forecast seem to do a good job of representing the behavior of the data? Why or why not? Include visualizations of the forecast and the original time series as appropriate.
suppressWarnings({suppressMessages({drift_lm_data = t_data %>%filter(date ==ymd(c('1995-07-01','2023-10-01')))drift_lm =lm(data = drift_lm_data,milk_price~date)drift_lm_pred = t_data %>%mutate(pred =predict(drift_lm,newdata=t_data))t_data %>%model(Naive =NAIVE(milk_price~drift()) ) %>%forecast(h =6) %>%autoplot(milk_df, level =NULL,size=1, title ='Naive Forecast with Drift') +geom_vline(aes(xintercept =ymd("2023-11-01")), color ="red", linetype ="dashed") +geom_line(data=drift_lm_pred,aes(date,pred),color='blue',linetype='dashed')+theme_bw()+ylab('Avg Milk Pric') +ggtitle('Naive Forecast with Drift')})})
For the forecasting task, I conducted analyses using the Mean of the series, Naive, and Naive with drift methods. The Naive with drift approach exhibited a noteworthy forecasting outcome among the methods tested. However, it may not be the optimal approach for accurately representing the data’s behavior. Alternative methods such as SARIMA or ETS models could be more suitable, as they incorporate seasonality and trend components for forecasting. Currently, these components are not utilized, despite their presence in the data.