A time-series data is a series of data points or observations recorded at different or regular time intervals. In general, a time series is a sequence of data points taken at equally spaced time intervals. The frequency of recorded data points may be hourly, daily, weekly, monthly, quarterly, or annually.
Time-Series Forecasting is the process of using a statistical model to predict future values of a time-series based on past results.
A time series analysis encompasses statistical methods for analyzing time series data. These methods enable us to extract meaningful statistics, patterns, and other characteristics of the data. Time series are visualized with the help of line charts. So, time series analysis involves understanding inherent aspects of the time series data so that we can create meaningful and accurate forecasts.
Applications of time series are used in statistics, finance, or business applications. A very common example of time series data is the daily closing value of the stock index like NASDAQ or Dow Jones. Other common applications of time series are sales and demand forecasting, weather forecasting, econometrics, signal processing, pattern recognition, and earthquake prediction.
Trend: The trend shows a general direction of the time series data over a long period of time. A trend can be increasing (upward), decreasing (downward), or horizontal (stationary).
Seasonality: The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude. Some examples include an increase in water consumption in summer due to hot weather conditions.
Cyclical Component: These are the trends with no set repetition over a particular period. A cycle refers to the period of ups and downs, booms and slumps in a time series, mostly observed in business cycles. These cycles do not exhibit a seasonal variation but generally occur over a period of 3 to 12 years depending on the nature of the time series.
Irregular Variation: These are fluctuations in the time series data that become evident when trend and cyclical variations are removed. These variations are unpredictable, erratic, and may or may not be random.
ETS Decomposition: ETS Decomposition is used to separate different components of a time series. The term ETS stands for Error, Trend, and Seasonality.
In this notebook, we conduct time series analysis of US Air Passengers over time.
As stated above, the time series analysis is the statistical analysis of the time series data. A time series data means that data is recorded at different time periods or intervals. The time series data may be of three types:-
1 Time series data - The observations of the values of a variable recorded at different points in time is called time series data.
2 Cross sectional data - It is the data of one or more variables recorded at the same point in time.
3 Pooled data- It is the combination of time series data and cross sectional data.
There are various terms and concepts in time series that we should know. These are as follows:-
1 Dependence- It refers to the association of two observations of the same variable at prior time periods.
2 Stationarity- It shows the mean value of the series that remains constant over the time period. If past effects accumulate and the values increase towards infinity then stationarity is not met.
3 Differencing- Differencing is used to make the series stationary and to control the auto-correlations. There may be some cases in time series analyses where we do not require differencing and over-differenced series can produce wrong estimates.
4 Specification - It may involve the testing of the linear or non-linear relationships of dependent variables by using time series models such as ARIMA models.
5 Exponential Smoothing - Exponential smoothing in time series analysis predicts the one next period value based on the past and current value. It involves averaging of data such that the non-systematic components of each individual case or observation cancel out each other. The exponential smoothing method is used to predict the short term prediction.
6 Curve fitting - Curve fitting regression in time series analysis is used when data is in a non-linear relationship.
7 ARIMA - ARIMA stands for Auto Regressive Integrated Moving Average.
install.packages(c("fpp3", "lubridate", "dplyr", "tidyr", "forecast", "tseries", "imputeTS", "Metrics"))
## package 'fpp3' successfully unpacked and MD5 sums checked
## package 'lubridate' successfully unpacked and MD5 sums checked
## package 'dplyr' successfully unpacked and MD5 sums checked
## package 'tidyr' successfully unpacked and MD5 sums checked
## package 'forecast' successfully unpacked and MD5 sums checked
## package 'tseries' successfully unpacked and MD5 sums checked
## package 'imputeTS' successfully unpacked and MD5 sums checked
## package 'Metrics' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ASUS\AppData\Local\Temp\Rtmp6VQ6Es\downloaded_packages
install.packages("seasonal")
## package 'seasonal' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ASUS\AppData\Local\Temp\Rtmp6VQ6Es\downloaded_packages
# Loading core time series and data wrangling libraries
library(fpp3) # Includes tsibble, lubridate, ggplot2, and others for time series
library(feasts) # For time series decomposition
library(urca) # For KPSS test
library(zoo) # For rolling mean and std. deviation
library(tsibble) # For working with time series data frames
library(readr) # For reading the csv file
library(lubridate) # For date-time handling
library(dplyr) # Data manipulation
library(tidyr) # Data tidying
library(forecast) # ARIMA and forecasting models
library(tseries) # Time series analysis functions
library(imputeTS) # Missing value handling
library(Metrics) # Performance evaluation metrics
library(ggplot2) # For good visuals
# Load the dataset
AirPassengers <- read.csv("~/AirPassengers.csv")
# Statistics of data:
head(AirPassengers)
## Month X.Passengers
## 1 1949-01 112
## 2 1949-02 118
## 3 1949-03 132
## 4 1949-04 129
## 5 1949-05 121
## 6 1949-06 135
summary(AirPassengers)
## Month X.Passengers
## Length:144 Min. :104.0
## Class :character 1st Qu.:180.0
## Mode :character Median :265.5
## Mean :280.3
## 3rd Qu.:360.5
## Max. :622.0
colnames(AirPassengers) <- c("Date", "Number of Passengers")
head(AirPassengers)
## Date Number of Passengers
## 1 1949-01 112
## 2 1949-02 118
## 3 1949-03 132
## 4 1949-04 129
## 5 1949-05 121
## 6 1949-06 135
AirPassengers_tsibble <- AirPassengers %>%
mutate(
Date = yearmonth(Date),
Year = year(Date),
Month = month(Date, label=TRUE)
) %>% # Since its a monthly data..
as_tsibble(index = Date)
AirPassengers_tsibble
## # A tsibble: 144 x 4 [1M]
## Date `Number of Passengers` Year Month
## <mth> <int> <dbl> <ord>
## 1 1949 Jan 112 1949 Jan
## 2 1949 Feb 118 1949 Feb
## 3 1949 Mar 132 1949 Mar
## 4 1949 Apr 129 1949 Apr
## 5 1949 May 121 1949 May
## 6 1949 Jun 135 1949 Jun
## 7 1949 Jul 148 1949 Jul
## 8 1949 Aug 148 1949 Aug
## 9 1949 Sep 136 1949 Sep
## 10 1949 Oct 119 1949 Oct
## # ℹ 134 more rows
sapply(AirPassengers_tsibble, function(x) sum(is.na(x)))
## Date Number of Passengers Year
## 0 0 0
## Month
## 0
AirPassengers_tsibble <- AirPassengers_tsibble %>%
mutate(
LogPassengers = log(`Number of Passengers`), # Log transformation
DiffPassengers = difference(LogPassengers), # First-order differencing
SeasonalDiffPassengers = difference(LogPassengers, lag = 12) # Seasonal differencing
)
tail(AirPassengers_tsibble)
## # A tsibble: 6 x 7 [1M]
## Date `Number of Passengers` Year Month LogPassengers DiffPassengers
## <mth> <int> <dbl> <ord> <dbl> <dbl>
## 1 1960 Jul 622 1960 Jul 6.43 0.151
## 2 1960 Aug 606 1960 Aug 6.41 -0.0261
## 3 1960 Sep 508 1960 Sep 6.23 -0.176
## 4 1960 Oct 461 1960 Oct 6.13 -0.0971
## 5 1960 Nov 390 1960 Nov 5.97 -0.167
## 6 1960 Dec 432 1960 Dec 6.07 0.102
## # ℹ 1 more variable: SeasonalDiffPassengers <dbl>
AirPassengers_tsibble %>%
pivot_longer(cols = c(`Number of Passengers`, LogPassengers, DiffPassengers, SeasonalDiffPassengers),
names_to = "Transformation", values_to = "Value") %>%
ggplot(aes(x = Date, y = Value)) +
geom_line() +
facet_wrap(~ Transformation, scales = "free_y") +
labs(title = "Transformations of US Air Passengers Time Series")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
The transformations provide several insights into the US Air Passengers time series data, particularly concerning trends, seasonality, and variance stabilization:
Number of Passengers):
LogPassengers):
DiffPassengers):
SeasonalDiffPassengers):
These insights guide us in preparing the data for modeling, as stationary data typically results in more accurate forecasts and better model performance.
yearly_summary <- AirPassengers_tsibble %>%
index_by(Year) %>% # Aggregates data by year
summarise(
Mean = mean(`Number of Passengers`, na.rm = TRUE),
Variance = var(`Number of Passengers`, na.rm = TRUE),
Standard_Deviation = sd(`Number of Passengers`, na.rm = TRUE),
Min = min(`Number of Passengers`, na.rm = TRUE),
Max = max(`Number of Passengers`, na.rm = TRUE),
Sum = sum(`Number of Passengers`, na.rm = TRUE)
)
# Display the yearly summary statistics
yearly_summary
## # A tsibble: 12 x 7 [1Y]
## Year Mean Variance Standard_Deviation Min Max Sum
## <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 1949 127. 188. 13.7 104 148 1520
## 2 1950 140. 364. 19.1 114 170 1676
## 3 1951 170. 340. 18.4 145 199 2042
## 4 1952 197 527. 23.0 171 242 2364
## 5 1953 225 810. 28.5 180 272 2700
## 6 1954 239. 1220. 34.9 188 302 2867
## 7 1955 284 1776. 42.1 233 364 3408
## 8 1956 328. 2291. 47.9 271 413 3939
## 9 1957 368. 3351. 57.9 301 467 4421
## 10 1958 381 4164. 64.5 310 505 4572
## 11 1959 428. 4876. 69.8 342 559 5140
## 12 1960 476. 6043. 77.7 390 622 5714
# Create a box plot for the number of passengers by year
ggplot(AirPassengers_tsibble, aes(x = factor(Year), y = `Number of Passengers`)) +
geom_boxplot(show.legend = FALSE, outlier.shape = NA) +
stat_summary(fun = "mean", geom = "point", shape = 20, color = "red", size = 2) +
labs(title = "Year-wise Box Plot of Passenger Count", x = "Year", y = "Number of Passengers") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_discrete(limits = as.character(1949:1960)) # Set x-axis labels from 1949 to 1960
However, It is not mandatory that all time series must have a trend and/or seasonality. A time series may not have a distinct trend but have a seasonality and vice-versa.
plot_df <- function(data, x, y, title = "", xlabel = "Date", ylabel = "Number of Passengers") {
ggplot(data, aes(x = {{ x }}, y = {{ y }})) +
geom_line(color = "blue") +
labs(title = title, x = xlabel, y = ylabel) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
}
# Use the function to plot the data
plot_df(AirPassengers_tsibble, x = Date, y = `Number of Passengers`, title = "Trend and Seasonality")
Another important thing to consider is the cyclic behaviour. It happens when the rise and fall pattern in the series does not happen in fixed calendar-based intervals. We should not confuse ‘cyclic’ effect with ‘seasonal’ effect.
If the patterns are not of fixed calendar based frequencies, then it is cyclic. Because, unlike the seasonality, cyclic effects are typically influenced by the business and other socio-economic factors.
Value = Base Level + Trend + Seasonality + Error
Value = Base Level x Trend x Seasonality x Error
Decomposition of a time series can be performed by considering the series as an additive or multiplicative combination of the base level, trend, seasonal index and the residual term.
The seasonal_decompose in statsmodels implements this conveniently.
# Additive Decomposition using STL
additive_decomposition <- AirPassengers_tsibble %>%
model(STL(`Number of Passengers` ~ season(window = "periodic"))) %>%
components()
# Multiplicative Decomposition using X11
multiplicative_decomposition <- AirPassengers_tsibble %>%
model(X_13ARIMA_SEATS(`Number of Passengers` ~ x11())) %>%
components()
# Plot Additive Decomposition
autoplot(additive_decomposition) +
labs(title = "Additive Decomposition", x = "Date", y = "Number of Passengers") +
theme_minimal()
# Plot Multiplicative Decomposition
autoplot(multiplicative_decomposition) +
labs(title = "Multiplicative Decomposition", x = "Date", y = "Number of Passengers") +
theme_minimal()
decomposed <- decompose(ts(AirPassengers_tsibble$`Number of Passengers`, frequency = 12), type = "multiplicative")
plot(decomposed)
- If we look at the residuals of the additive decomposition closely, it
has some pattern left over.
Now, we wil discuss Stationary and Non-Stationary Time Series. Stationarity is a property of a time series. A stationary series is one where the values of the series is not a function of time. So, the values are independent of time.
Hence the statistical properties of the series like mean, variance and autocorrelation are constant over time. Autocorrelation of the series is nothing but the correlation of the series with its previous values.
A stationary time series is independent of seasonal effects as well.
Now, we will plot some examples of stationary and non-stationary time series for clarity.
Image source: Machine Learning Plus
If Y_t is the value at time t, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next value by the current value.
If the first difference doesn’t make a series stationary, we can go for the second differencing and so on.
For example, consider the following series: [1, 5, 2, 12, 20]
First differencing gives: [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]
Second differencing gives: [-3-4, -10-3, 8-10] = [-7, -13, -2]
There are reasons why we want to convert a non-stationary series into a stationary one. These are given below:
Forecasting a stationary series is relatively easy and the forecasts are more reliable.
An important reason is, autoregressive forecasting models are essentially linear regression models that utilize the lag(s) of the series itself as predictors.
We know that linear regression works best if the predictors (X variables) are not correlated against each other. So, stationarizing the series solves this problem since it removes any persistent autocorrelation, thereby making the predictors(lags of the series) in the forecasting models nearly independent.
The stationarity of a series can be checked by looking at the plot of the series.
Another method is to split the series into 2 or more contiguous parts and computing the summary statistics like the mean, variance and the autocorrelation. If the stats are quite different, then the series is not likely to be stationary.
There are several quantitative methods we can use to determine if a given series is stationary or not. This can be done using statistical tests called Unit Root Tests. This test checks if a time series is non-stationary and possess a unit root.
There are multiple implementations of Unit Root tests but we will use:
The KPSS test, on the other hand, is used to test for trend stationarity. The null hypothesis and the P-Value interpretation is just the opposite of ADH test.
Interested readers can learn more about the KPSS test from the below links:
https://en.wikipedia.org/wiki/KPSS_test
https://www.machinelearningplus.com/time-series/kpss-test-for-stationarity/
https://www.statisticshowto.com/kpss-test/
https://nwfsc-timeseries.github.io/atsa-labs/sec-boxjenkins-kpss.html
# Defining the stationarity test function:
test_stationarity <- function(timeseries, window = 52) {
# Calculate rolling mean and rolling standard deviation
rolmean <- rollmean(timeseries, k = window, fill = NA)
rolstd <- rollapply(timeseries, width = window, FUN = sd, fill = NA, align = "center")
# Plot original data, rolling mean, and rolling standard deviation
p <- ggplot() +
geom_line(aes(x = index(timeseries), y = coredata(timeseries)), color = "blue", linewidth = 0.8, linetype = "solid") +
geom_line(aes(x = index(timeseries), y = rolmean), color = "red", linewidth = 0.8, linetype = "dashed") +
geom_line(aes(x = index(timeseries), y = rolstd), color = "black", linewidth = 0.8, linetype = "dotted") +
labs(title = "Rolling Mean & Standard Deviation",
x = "Date",
y = "Number of Passengers") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom") +
labs(color = "Legend") +
scale_color_manual(values = c("blue", "red", "black"),
labels = c("Original", "Rolling Mean", "Rolling Std"))
# Print the plot:
print(p)
# Perform the KPSS Test:
print("Results of KPSS Test:")
kptest <- ur.kpss(timeseries, type = "tau")
summary(kptest)
}
test_stationarity(AirPassengers_tsibble$`Number of Passengers`)
## Warning: Removed 51 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 51 rows containing missing values or values outside the scale range
## (`geom_line()`).
## [1] "Results of KPSS Test:"
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: tau with 4 lags.
##
## Value of test-statistic is: 0.0961
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.119 0.146 0.176 0.216
By Observing the plot in the result of the KPSS Test, we infer the test statistic is less than critical value so we fail to reject the null hypothesis
This shows that the time series is stationary.
Like a stationary series, the white noise is also not a function of time. So, its mean and variance does not change over time. But the difference is that, the white noise is completely random with a mean of 0. In white noise there is no pattern.
Mathematically, a sequence of completely random numbers with mean zero is a white noise.
# Generate 1000 random numbers from a standard normal distribution
rand_numbers <- rnorm(1000)
# Convert to a data frame for plotting
rand_df <- data.frame(Index = 1:1000, Value = rand_numbers)
# Plot the random numbers to represent white noise
ggplot(rand_df, aes(x = Index, y = Value)) +
geom_line(color = "blue") +
labs(title = "Random White Noise", x = "Index", y = "Value") +
theme_minimal()
# Splitting the data:
train <- head(AirPassengers_tsibble$`Number of Passengers`, 120)
test <- tail(AirPassengers_tsibble$`Number of Passengers`, 24)
# Smoothing Parameter:
alphas <- seq(0.01, 1, by=0.10)
# SES optimizer function to find the best alpha based on MAE
ses_optimizer <- function(train, test, alphas, step = 24) {
best_alpha <- NULL
best_mae <- Inf
for (alpha in alphas) {
# Fit the SES model with the current alpha
ses_model <- ses(train, alpha = alpha, h = step)
y_pred <- ses_model$mean
# Calculate MAE between the forecast and the test set
mae <- mae(test, y_pred)
# Update best_alpha and best_mae if current MAE is lower
if (mae < best_mae) {
best_alpha <- alpha
best_mae <- mae
}
cat("Alpha:", round(alpha, 2), "MAE:", round(mae, 4), "\n")
}
cat("Best Alpha is:", round(best_alpha, 2), "with Best MAE:", round(best_mae, 4), "\n")
return(list(best_alpha = best_alpha, best_mae = best_mae))
}
# Find the best alpha
opt_results <- ses_optimizer(train, test, alphas)
## Alpha: 0.01 MAE: 190.126
## Alpha: 0.11 MAE: 82.528
## Alpha: 0.21 MAE: 82.8979
## Alpha: 0.31 MAE: 89.8377
## Alpha: 0.41 MAE: 99.0585
## Alpha: 0.51 MAE: 107.5558
## Alpha: 0.61 MAE: 113.7514
## Alpha: 0.71 MAE: 117.2224
## Alpha: 0.81 MAE: 118.1776
## Alpha: 0.91 MAE: 117.2438
## Best Alpha is: 0.11 with Best MAE: 82.528
best_ses_alpha <- opt_results$best_alpha
# Fit the final SES model with the best alpha
final_ses_model <- ses(train, alpha = best_ses_alpha, h = 24)
y_pred_ses <- final_ses_model$mean
forecast_errors <- test - y_pred_ses
forecast_se <- sd(forecast_errors)
# Prediction intervals
lower_bounds <- y_pred_ses - 1.96 * forecast_se
upper_bounds <- y_pred_ses + 1.96 * forecast_se
# Plotting function
plot_ses <- function(train, test, y_pred, lower_bounds, upper_bounds, title = "Simple Exponential Smoothing Forecast") {
# Create a combined data frame for plotting
df <- data.frame(
Date = c(1:length(train), (length(train) + 1):(length(train) + length(test))),
Passengers = c(train, test),
DataType = c(rep("Train", length(train)), rep("Test", length(test)))
)
# Predicted values data frame
df_pred <- data.frame(
Date = (length(train) + 1):(length(train) + length(y_pred)),
Passengers = y_pred,
Lower = lower_bounds,
Upper = upper_bounds,
DataType = rep("Prediction", length(y_pred))
)
# Plot the training, test, and predicted data
p <- ggplot() +
geom_line(data = df, aes(x = Date, y = Passengers, color = DataType), size = 1) +
geom_line(data = df_pred, aes(x = Date, y = Passengers, color = DataType), size = 1) +
geom_ribbon(data = df_pred, aes(x = Date, ymin = Lower, ymax = Upper, fill = DataType), alpha = 0.3) +
labs(title = title, x = "Time", y = "Number of Passengers") +
scale_color_manual(values = c("Train" = "blue", "Test" = "green", "Prediction" = "red")) +
theme_minimal() +
theme(legend.position = "bottom")
print(p)
}
# Plot the SES forecast
plot_ses(train, test, y_pred_ses, lower_bounds, upper_bounds, title = "Simple Exponential Smoothing Forecast")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Smoothing parameters:
alphas <- seq(0.01, 1, by = 0.10)
betas <- seq(0.01, 1, by = 0.10)
# Double Exponential Smoothing optimizer function with error handling
des_optimizer <- function(train, test, alphas, betas, step = 24) {
best_alpha <- NULL
best_beta <- NULL
best_mae <- Inf
for (alpha in alphas) {
for (beta in betas) {
# Try fitting the model and handle errors
tryCatch({
# Fit the Double Exponential Smoothing model with current alpha and beta
des_model <- holt(train, alpha = alpha, beta = beta, h = step, damped = FALSE, trend = "additive")
y_pred <- des_model$mean
# Calculate MAE between forecast and test set
mae <- mae(test, y_pred)
# Update best_alpha, best_beta, and best_mae if current MAE is lower
if (mae < best_mae) {
best_alpha <- alpha
best_beta <- beta
best_mae <- mae
}
cat("Alpha:", round(alpha, 2), "Beta:", round(beta, 2), "MAE:", round(mae, 4), "\n")
}, error = function(e) {
# Skip invalid models and print a message
cat("Alpha:", round(alpha, 2), "Beta:", round(beta, 2), "could not fit a valid model.\n")
})
}
}
cat("Best Alpha:", round(best_alpha, 2), "Best Beta:", round(best_beta, 2), "Best MAE:", round(best_mae, 4), "\n")
return(list(best_alpha = best_alpha, best_beta = best_beta, best_mae = best_mae))
}
# Find the best alpha and beta
opt_results <- des_optimizer(train, test, alphas, betas)
## Alpha: 0.01 Beta: 0.01 MAE: 58.3642
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.11 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.21 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.31 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.41 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.51 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.61 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.01 Beta: 0.91 could not fit a valid model.
## Alpha: 0.11 Beta: 0.01 MAE: 56.2696
## Alpha: 0.11 Beta: 0.11 MAE: 68.4861
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.21 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.31 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.41 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.51 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.61 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.11 Beta: 0.91 could not fit a valid model.
## Alpha: 0.21 Beta: 0.01 MAE: 60.0355
## Alpha: 0.21 Beta: 0.11 MAE: 185.5454
## Alpha: 0.21 Beta: 0.21 MAE: 865.9462
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.31 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.41 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.51 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.61 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.21 Beta: 0.91 could not fit a valid model.
## Alpha: 0.31 Beta: 0.01 MAE: 69.0191
## Alpha: 0.31 Beta: 0.11 MAE: 250.9023
## Alpha: 0.31 Beta: 0.21 MAE: 724.9144
## Alpha: 0.31 Beta: 0.31 MAE: 1054.183
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.31 Beta: 0.41 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.31 Beta: 0.51 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.31 Beta: 0.61 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.31 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.31 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.31 Beta: 0.91 could not fit a valid model.
## Alpha: 0.41 Beta: 0.01 MAE: 78.7054
## Alpha: 0.41 Beta: 0.11 MAE: 272.5327
## Alpha: 0.41 Beta: 0.21 MAE: 607.6325
## Alpha: 0.41 Beta: 0.31 MAE: 840.2858
## Alpha: 0.41 Beta: 0.41 MAE: 842.9667
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.41 Beta: 0.51 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.41 Beta: 0.61 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.41 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.41 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.41 Beta: 0.91 could not fit a valid model.
## Alpha: 0.51 Beta: 0.01 MAE: 87.9837
## Alpha: 0.51 Beta: 0.11 MAE: 269.3238
## Alpha: 0.51 Beta: 0.21 MAE: 513.0025
## Alpha: 0.51 Beta: 0.31 MAE: 678.7275
## Alpha: 0.51 Beta: 0.41 MAE: 683.8468
## Alpha: 0.51 Beta: 0.51 MAE: 568.958
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.51 Beta: 0.61 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.51 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.51 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.51 Beta: 0.91 could not fit a valid model.
## Alpha: 0.61 Beta: 0.01 MAE: 94.4417
## Alpha: 0.61 Beta: 0.11 MAE: 253.8162
## Alpha: 0.61 Beta: 0.21 MAE: 434.8159
## Alpha: 0.61 Beta: 0.31 MAE: 552.246
## Alpha: 0.61 Beta: 0.41 MAE: 551.2462
## Alpha: 0.61 Beta: 0.51 MAE: 448.5929
## Alpha: 0.61 Beta: 0.61 MAE: 274.5915
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.61 Beta: 0.71 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.61 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.61 Beta: 0.91 could not fit a valid model.
## Alpha: 0.71 Beta: 0.01 MAE: 97.7789
## Alpha: 0.71 Beta: 0.11 MAE: 233.3323
## Alpha: 0.71 Beta: 0.21 MAE: 369.8343
## Alpha: 0.71 Beta: 0.31 MAE: 453.105
## Alpha: 0.71 Beta: 0.41 MAE: 447.3636
## Alpha: 0.71 Beta: 0.51 MAE: 361.3254
## Alpha: 0.71 Beta: 0.61 MAE: 220.8079
## Alpha: 0.71 Beta: 0.71 MAE: 69.0392
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.71 Beta: 0.81 could not fit a valid model.
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.71 Beta: 0.91 could not fit a valid model.
## Alpha: 0.81 Beta: 0.01 MAE: 98.2514
## Alpha: 0.81 Beta: 0.11 MAE: 211.7285
## Alpha: 0.81 Beta: 0.21 MAE: 316.1826
## Alpha: 0.81 Beta: 0.31 MAE: 375.9728
## Alpha: 0.81 Beta: 0.41 MAE: 368.3713
## Alpha: 0.81 Beta: 0.51 MAE: 298.9904
## Alpha: 0.81 Beta: 0.61 MAE: 187.98
## Alpha: 0.81 Beta: 0.71 MAE: 68.8423
## Alpha: 0.81 Beta: 0.81 MAE: 102.2852
## [1] "Model: ETS(A,A,N)"
## Alpha: 0.81 Beta: 0.91 could not fit a valid model.
## Alpha: 0.91 Beta: 0.01 MAE: 96.6515
## Alpha: 0.91 Beta: 0.11 MAE: 190.9407
## Alpha: 0.91 Beta: 0.21 MAE: 272.1183
## Alpha: 0.91 Beta: 0.31 MAE: 315.9012
## Alpha: 0.91 Beta: 0.41 MAE: 308.1503
## Alpha: 0.91 Beta: 0.51 MAE: 252.7779
## Alpha: 0.91 Beta: 0.61 MAE: 164.5791
## Alpha: 0.91 Beta: 0.71 MAE: 68.7598
## Alpha: 0.91 Beta: 0.81 MAE: 83.6097
## Alpha: 0.91 Beta: 0.91 MAE: 159.7716
## Best Alpha: 0.11 Best Beta: 0.01 Best MAE: 56.2696
best_des_alpha <- opt_results$best_alpha
best_des_beta <- opt_results$best_beta
# Fit the final DES model with the best alpha and beta
final_des_model <- holt(train, alpha = best_des_alpha, beta = best_des_beta, h = 24, damped = FALSE, trend = "additive")
y_pred_des <- final_des_model$mean
# After fitting the final DES model
forecast_errors_des <- test - y_pred_des
forecast_se_des <- sd(forecast_errors_des)
# Prediction intervals
lower_bounds_des <- y_pred_des - 1.96 * forecast_se_des
upper_bounds_des <- y_pred_des + 1.96 * forecast_se_des
# Plotting function for Double Exponential Smoothing
plot_des <- function(train, test, y_pred, lower_bounds, upper_bounds, title = "Double Exponential Smoothing Forecast") {
# Create a combined data frame for plotting
df <- data.frame(
Date = c(1:length(train), (length(train) + 1):(length(train) + length(test))),
Passengers = c(train, test),
DataType = c(rep("Train", length(train)), rep("Test", length(test)))
)
# Predicted values data frame
df_pred <- data.frame(
Date = (length(train) + 1):(length(train) + length(y_pred)),
Passengers = y_pred,
Lower = lower_bounds,
Upper = upper_bounds,
DataType = rep("Prediction", length(y_pred))
)
# Plot the training, test, and predicted data
p <- ggplot() +
geom_line(data = df, aes(x = Date, y = Passengers, color = DataType), size = 1) +
geom_line(data = df_pred, aes(x = Date, y = Passengers, color = DataType), size = 1) +
geom_ribbon(data = df_pred, aes(x = Date, ymin = Lower, ymax = Upper, fill = DataType), alpha = 0.3) +
labs(title = title, x = "Time", y = "Number of Passengers") +
scale_color_manual(values = c("Train" = "blue", "Test" = "green", "Prediction" = "red")) +
theme_minimal() +
theme(legend.position = "bottom")
print(p)
}
# Plot the DES forecast
plot_des(train, test, y_pred_des, lower_bounds_des, upper_bounds_des, title = "DES MODEL")
# Smoothing Parameters:
alpha <- seq(0.01, 1, by = 0.10)
beta <- seq(0.01, 1, by = 0.10)
gamma <- seq(0.01, 1, by = 0.10)
# Create combinations of alpha, beta, gamma
abg <- expand.grid(alpha = alpha, beta = beta, gamma = gamma)
# Monthly data: two full cycles
train_ts <- ts(train, frequency = 12)
# Triple Exponential Smoothing optimizer function using HoltWinters
tes_optimizer <- function(train_ts, abg, step = 24, ci_level = c(80, 95)) {
best_model <- NULL
best_mae <- Inf
for(i in 1:nrow(abg)) {
comb <- abg[i, ]
tes_model <- tryCatch({HoltWinters(train_ts,
alpha = comb$alpha,
beta = comb$beta,
gamma = comb$gamma,
seasonal = "additive")
}, error = function(e) return(NULL))
if (!is.null(tes_model)) {
tes_forecast <- forecast(tes_model, h = step, level = ci_level)
mae <- mean(abs(tes_forecast$mean - test)) # Ensure 'test' is defined or adapt as needed
if (mae < best_mae) {
best_model <- tes_model
best_mae <- mae
best_alpha <- comb$alpha
best_beta <- comb$beta
best_gamma <- comb$gamma
}
}
}
cat(sprintf("Best Alpha: %.2f, Best Beta: %.2f, Best Gamma: %.2f, Best MAE: %.4f\n",
best_alpha, best_beta, best_gamma, best_mae))
return(list(best_model = best_model, best_mae = best_mae))
}
results <- tes_optimizer(train_ts, abg)
## Best Alpha: 0.21, Best Beta: 0.41, Best Gamma: 0.61, Best MAE: 9.8513
# Final model using the best parameters
final_model <- results$best_model
final_forecast <- forecast(final_model, h = 24)
# Plotting function for Triple Exponential Smoothing
plot_tes <- function(train, test, forecast_obj, title = "Triple Exponential Smoothing Forecast") {
# Extract prediction and confidence intervals
y_pred <- forecast_obj$mean
lower <- forecast_obj$lower[,2]
upper <- forecast_obj$upper[,2]
# Lengths of each data segment
train_len <- length(train)
test_len <- length(test)
pred_len <- length(y_pred)
train_dates <- 1:train_len
forecast_dates <- (train_len + 1):(train_len + pred_len)
# Create a combined data frame for plotting
# Data frames for plotting
df_train <- data.frame(Date = train_dates, Passengers = train, DataType = "Train")
df_test <- data.frame(Date = forecast_dates, Passengers = test, DataType = "Test")
df_pred <- data.frame(Date = forecast_dates, Passengers = y_pred, DataType = "Prediction")
df_conf_int <- data.frame(Date = forecast_dates, Lower = lower, Upper = upper)
# Combine data frames
df_all <- rbind(df_train, df_test, df_pred)
# Plot the training, test, and predicted data
ggplot(df_all, aes(x = Date, y = Passengers, color = DataType)) +
geom_line() +
geom_ribbon(data = df_conf_int, aes(ymin = Lower, ymax = Upper, x = Date), fill = "orange", alpha = 0.3, inherit.aes = FALSE) +
labs(title = title, x = "Time", y = "Number of Passengers") +
scale_color_manual(values = c("Train" = "blue", "Test" = "green", "Prediction" = "red")) +
theme_minimal() +
theme(legend.position = "bottom")
}
plot_tes(train, test, final_forecast, title = "TES Model Forecast with 95% confidence interval")
auto_fit <- auto.arima(train, seasonal=TRUE, stepwise=FALSE, approximation=FALSE, trace=TRUE)
##
## ARIMA(0,1,0) : 1138.843
## ARIMA(0,1,0) with drift : 1140.395
## ARIMA(0,1,1) : 1127.123
## ARIMA(0,1,1) with drift : 1128.877
## ARIMA(0,1,2) : 1127.182
## ARIMA(0,1,2) with drift : 1128.701
## ARIMA(0,1,3) : 1125.796
## ARIMA(0,1,3) with drift : Inf
## ARIMA(0,1,4) : 1108.02
## ARIMA(0,1,4) with drift : Inf
## ARIMA(0,1,5) : 1106.023
## ARIMA(0,1,5) with drift : Inf
## ARIMA(1,1,0) : 1130.76
## ARIMA(1,1,0) with drift : 1132.545
## ARIMA(1,1,1) : 1125.638
## ARIMA(1,1,1) with drift : 1127.299
## ARIMA(1,1,2) : 1118.45
## ARIMA(1,1,2) with drift : Inf
## ARIMA(1,1,3) : 1120.32
## ARIMA(1,1,3) with drift : Inf
## ARIMA(1,1,4) : 1110.116
## ARIMA(1,1,4) with drift : Inf
## ARIMA(2,1,0) : 1127.053
## ARIMA(2,1,0) with drift : 1128.688
## ARIMA(2,1,1) : 1113.397
## ARIMA(2,1,1) with drift : Inf
## ARIMA(2,1,2) : Inf
## ARIMA(2,1,2) with drift : 1087.779
## ARIMA(2,1,3) : 1119.766
## ARIMA(2,1,3) with drift : Inf
## ARIMA(3,1,0) : 1126.612
## ARIMA(3,1,0) with drift : 1128.075
## ARIMA(3,1,1) : 1114.967
## ARIMA(3,1,1) with drift : Inf
## ARIMA(3,1,2) : 1113.326
## ARIMA(3,1,2) with drift : Inf
## ARIMA(4,1,0) : 1119.263
## ARIMA(4,1,0) with drift : 1119.784
## ARIMA(4,1,1) : 1118.672
## ARIMA(4,1,1) with drift : 1119.764
## ARIMA(5,1,0) : 1121.282
## ARIMA(5,1,0) with drift : 1121.966
##
##
##
## Best model: ARIMA(2,1,2) with drift
# the selected model
summary(auto_fit)
## Series: train
## ARIMA(2,1,2) with drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 drift
## 1.6431 -0.9066 -1.8743 0.9688 2.1761
## s.e. 0.0371 0.0361 0.0572 0.0595 0.7107
##
## sigma^2 = 488.2: log likelihood = -537.51
## AIC=1087.03 AICc=1087.78 BIC=1103.7
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.3246097 21.53481 17.21691 -0.5203958 7.005998 0.776948
## ACF1
## Training set 0.08174572
# Forecasting
forecasted_values <- forecast(auto_fit, h=24)
# Check diagnostics
checkresiduals(auto_fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2) with drift
## Q* = 47.412, df = 6, p-value = 1.548e-08
##
## Model df: 4. Total lags used: 10
# Plotting function for SARIMA
plot_sarima <- function(train, test, best_fit, h = 24, title = "SARIMA Forecast") {
# Forecast future values
forecasted_values <- forecast(best_fit, h = h)
# Lengths of each data segment
train_len <- length(train)
test_len <- length(test)
forecast_dates <- (train_len + 1):(train_len + h)
# Create a combined data frame for plotting
df_train <- data.frame(
Date = 1:train_len,
Passengers = train,
DataType = "Train"
)
df_test <- data.frame(
Date = (train_len + 1):(train_len + test_len),
Passengers = test,
DataType = "Test"
)
df_pred <- data.frame(
Date = (train_len + 1):(train_len + length(forecasted_values$mean)),
Passengers = forecasted_values$mean,
DataType = "Prediction"
)
# Data frames for confidence intervals
df_conf_int <- data.frame(
Date = forecast_dates,
Lower = forecasted_values$lower[, "95%"],
Upper = forecasted_values$upper[, "95%"]
)
# Combine data frames
df_all <- rbind(df_train, df_test, df_pred)
# Plot the training, test, and predicted data
p <- ggplot(df_all, aes(x = Date, y = Passengers, color = DataType)) +
geom_line() +
geom_ribbon(data = df_conf_int, aes(ymin = Lower, ymax = Upper, x = Date), fill = "orange", alpha = 0.3, inherit.aes = FALSE) +
labs(title = title, x = "Time", y = "Number of Passengers") +
scale_color_manual(values = c("Train" = "blue", "Test" = "green", "Prediction" = "red")) +
theme_minimal() +
theme(legend.position = "bottom")
print(p)
}
plot_sarima(train, test, best_fit = auto_fit)
In the evaluation of various time series forecasting models—Simple Exponential Smoothing (SES), Double Exponential Smoothing (DES), Triple Exponential Smoothing (TES), and Seasonal Autoregressive Integrated Moving Average (SARIMA)—Triple Exponential Smoothing emerged as the most effective model. The assessment was based on the accuracy of forecasting future data points in a given time series, where TES demonstrated superior performance compared to its counterparts.
Incorporation of Seasonality: Unlike SES and DES, TES includes a seasonal component, which is crucial for datasets exhibiting regular seasonal variations. This allows TES to adjust its forecasts to align with periodic fluctuations, making it more accurate for data with inherent seasonality.
Level, Trend, and Seasonality Modeling: TES extends beyond DES by not only accounting for the level and trend components of the series but also explicitly modeling the seasonal variations. This comprehensive approach provides a more nuanced understanding and adjustment for changes in the data, enhancing predictive accuracy.
Flexibility in Parameters: TES allows for the adjustment of parameters related to level, trend, and seasonality. This flexibility enables the model to be finely tuned to the specific characteristics of the data, such as smoothing parameters and the type of seasonal patterns (additive or multiplicative), thus optimizing performance.
Robustness to Data Irregularities: The additional parameters and components in TES make it robust against anomalies and abrupt changes in the dataset, unlike simpler models like SES and DES, which might either underfit or overfit in the presence of noise and outliers.
In conclusion, Triple Exponential Smoothing’s superior performance can be attributed to its comprehensive approach to modeling time series data, encompassing level, trend, and seasonality. This holistic model structure allows TES to deliver more accurate forecasts, particularly in data environments where seasonal patterns play a significant role. The adaptability and robustness of TES make it an ideal choice for complex forecasting tasks where simpler models might falter.
To enhance the effectiveness of the Triple Exponential Smoothing (TES) model or the overall forecasting approach, consider implementing the following strategies:
Model Ensembling: Combine the strengths of TES with other forecasting models, such as SARIMA or machine learning methods like Random Forests or Gradient Boosting Machines. Ensembling can improve forecast accuracy by reducing variance and bias.
Incorporating Exogenous Variables: If external factors significantly impact the time series, incorporating these as exogenous variables could enhance the model’s accuracy. For example, adding variables related to economic indicators, weather conditions, or special events might yield better forecasts.
Error Analysis: Perform a detailed analysis of forecast errors to pinpoint where the model excels or fails. This insight can guide modifications to the model or its parameters, addressing any discovered biases or systematic errors.
Increasing Data Granularity: If possible, using data with higher frequency (e.g., switching from monthly to weekly data) could provide more detailed insights into seasonal patterns, potentially improving the model’s predictive accuracy.
Model Updating: Regular updates to the model with new data ensure it remains relevant and accurately reflects any shifts in trends or seasonality, crucial for maintaining forecasting accuracy over time.
Robustness Checks: Conduct stress tests and robustness checks by simulating various scenarios to evaluate how the model performs under different conditions. These tests can reveal weaknesses in the model’s assumptions and improve its reliability.
Advanced Time Series Techniques: Investigate advanced time series methodologies such as state space models or Bayesian approaches. These techniques may offer superior flexibility and robustness, particularly in handling complex or non-linear time series data.
By implementing these suggestions, the forecasting process can be significantly enhanced, leading to more reliable and precise predictions.