library(forecast)
## Warning: package 'forecast' was built under R version 4.4.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.4.2
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ ggplot2   3.5.1     ✔ expsmooth 2.3  
## ✔ fma       2.5
## Warning: package 'fma' was built under R version 4.4.2
## Warning: package 'expsmooth' was built under R version 4.4.2
## 
library(ggplot2)
library(readxl)
library(gridExtra)

3.1 For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.

data("usnetelec")
data("usgdp")
data("mcopper")
data("enplanements")

# Check for missing values
if (anyNA(usnetelec) | anyNA(usgdp) | anyNA(mcopper) | anyNA(enplanements)) {
  stop("Error: One or more datasets contain missing values. Handle them before transformation.")
}

# Ensure all values are positive (Box-Cox transformation requires positive values)
if (any(usnetelec <= 0) | any(usgdp <= 0) | any(mcopper <= 0) | any(enplanements <= 0)) {
  usnetelec <- usnetelec + 1
  usgdp <- usgdp + 1
  mcopper <- mcopper + 1
  enplanements <- enplanements + 1
}

# Create a function to apply Box-Cox transformation and plot
plot_boxcox <- function(data, name) {
  lambda <- BoxCox.lambda(data)
  transformed_data <- BoxCox(data, lambda)
  
  df <- data.frame(Time = time(data), Value = transformed_data)
  
  ggplot(df, aes(x = Time, y = Value)) +
    geom_line(size = 1) +
    ggtitle(paste("Box-Cox Transformation:", name, "\nLambda =", round(lambda, 2))) +
    theme_minimal() +
    xlab("Time") +
    ylab("Transformed Value")
}

# Generate plots for each dataset
p1 <- plot_boxcox(usnetelec, "usnetelec")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- plot_boxcox(usgdp, "usgdp")
p3 <- plot_boxcox(mcopper, "mcopper")
p4 <- plot_boxcox(enplanements, "enplanements")

# Display plots
grid.arrange(p1, p2, p3, p4, ncol = 2)
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

### Observations

1. usnetelec (λ = 0.52)

  • The original data has an upward trend and increasing variance over time.
  • After applying the Box-Cox transformation, the variance looks more stable, though there’s still some upward movement.

2. usgdp (λ = 0.37)

  • The series shows exponential growth, meaning variance increases as values get larger.
  • The transformation helps reduce this effect, making the trend a bit more linear and the variance more even.

3. mcopper (λ = 0.19)

  • There’s a lot of fluctuation in the original data, and variance seems to increase over time.
  • The transformation smooths things out a bit, but some variability remains.

4. enplanements (λ = -0.23)

  • This dataset has strong seasonality and a clear upward trend, with variance increasing over time.
  • The transformation helps stabilize the variance, but there are still noticeable fluctuations.

3.2 Why is a Box-Cox transformation unhelpful for the cangas data?

# Load dataset
data("cangas")

# Box-Cox transformation
lambda <- BoxCox.lambda(cangas)
transformed_data <- BoxCox(cangas, lambda)

# Create data frames for plotting
df_original <- data.frame(Time = time(cangas), Value = as.numeric(cangas))
df_transformed <- data.frame(Time = time(cangas), Value = as.numeric(transformed_data))

# Plot original data
p1 <- ggplot(df_original, aes(x = Time, y = Value)) +
  geom_line(size = 1) +
  ggtitle("Original Data: cangas") +
  theme_minimal() +
  xlab("Time") +
  ylab("Value")

# Plot Box-Cox transformed data
p2 <- ggplot(df_transformed, aes(x = Time, y = Value)) +
  geom_line(size = 1) +
  ggtitle(paste("Box-Cox Transformed: cangas \nLambda =", round(lambda, 2))) +
  theme_minimal() +
  xlab("Time") +
  ylab("Transformed Value")

# Display plots side by side
grid.arrange(p1, p2, ncol = 2)
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

Observation

The original cangas data shows a clear upward trend with increasing variance. After applying the Box-Cox transformation (λ = 0.58), the variance is slightly stabilized, but the overall trend and fluctuations still remain. The transformation doesn’t fully remove the increasing spread of values, especially in the later years, meaning it doesn’t completely fix the variance issue. The Box-Cox transformation works best when variance increases proportionally with the mean, but in this case, the cangas data has both trend and seasonality that aren’t just due to variance changes. The transformation helps a little but doesn’t fully make the series stationary. A better approach might be differencing the data to remove the trend before applying transformations.

3.3 What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

# Define file path (ensure correct escaping of backslashes)
file_path <- "C:/Users/Admin/Downloads/retail.xlsx"

# Read the dataset, skipping the first row
retaildata <- read_excel(file_path, skip = 1)

# Define the column name (update this if the column name is different)
col_name <- "A3349337W"

# Check if column exists
if (!col_name %in% colnames(retaildata)) {
  stop("Error: Column name not found in the dataset. Check the correct column name.")
}

# Convert the data to a time series object
myts <- ts(retaildata[[col_name]], frequency = 12, start = c(1982, 4))

# Plot the time series without custom colors
ggplot(data = data.frame(Time = time(myts), Value = as.numeric(myts)), aes(x = Time, y = Value)) +
  geom_line(size = 1) +  
  ggtitle("Retail Time Series Data") +
  xlab("Year") +
  ylab("Value") +
  theme_minimal()
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

 ggseasonplot(myts)

ggsubseriesplot(myts)

gglagplot(myts)

ggAcf(myts)

(lambda <- BoxCox.lambda(myts))
## [1] 0.9165544
autoplot(BoxCox(myts,lambda))

### Observations - The Box-Cox lambda value is 0.92, which is close to 1, meaning the data is already close to being normally distributed.
- The original time series shows a strong upward trend and seasonal variations, suggesting that variance increases over time.
- The Box-Cox transformation slightly stabilizes the variance but does not significantly change the overall trend or pattern.
- Since λ is close to 1, applying a transformation like log or square root wouldn’t make a drastic difference.
- A log transformation (log(x)) or power transformation (x^0.92) could still help, but differencing might be more effective in handling the trend.

3.4 For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect. dole, usdeaths, bricksq.

# Load datasets
data("dole")
data("usdeaths")
data("bricksq")

# Create function for time series plots
plot_series <- function(series, name) {
  df <- data.frame(Time = time(series), Value = as.numeric(series))
  
  ggplot(df, aes(x = Time, y = Value)) +
    geom_line(size = 1) +
    ggtitle(paste("Time Series Plot of", name)) +
    theme_minimal() +
    xlab("Time") +
    ylab("Value")
}

# Plot original data
p1 <- plot_series(dole, "Dole")
p2 <- plot_series(usdeaths, "US Deaths")
p3 <- plot_series(bricksq, "Bricksq")

# Arrange plots in a grid
grid.arrange(p1, p2, p3, ncol = 1)
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

# Load datasets
data("dole")
data("usdeaths")
data("bricksq")

# Compute optimal lambda for each dataset
(lambda_dole <- BoxCox.lambda(dole))
## [1] 0.3290922
(lambda_usdeaths <- BoxCox.lambda(usdeaths))
## [1] -0.03363775
(lambda_bricksq <- BoxCox.lambda(bricksq))
## [1] 0.2548929
# Apply Box-Cox transformation
dole_transformed <- BoxCox(dole, lambda_dole)
usdeaths_transformed <- BoxCox(usdeaths, lambda_usdeaths)
bricksq_transformed <- BoxCox(bricksq, lambda_bricksq)

# Function to plot transformed data
plot_series <- function(series, name) {
  df <- data.frame(Time = time(series), Value = as.numeric(series))
  
  ggplot(df, aes(x = Time, y = Value)) +
    geom_line(size = 1) +
    ggtitle(paste("Box-Cox Transformed:", name)) +
    theme_minimal() +
    xlab("Time") +
    ylab("Transformed Value")
}

# Plot transformed data
p4 <- plot_series(dole_transformed, "Dole (Box-Cox)")
p5 <- plot_series(usdeaths_transformed, "US Deaths (Box-Cox)")
p6 <- plot_series(bricksq_transformed, "Bricksq (Box-Cox)")

# Arrange transformed plots in a grid
grid.arrange(p4, p5, p6, ncol = 1)
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

### Observations

Dole (λ = 0.33)

  • The original series has an upward trend with increasing variance, especially in later years.
  • After applying the Box-Cox transformation, the variance is more stable, but the overall trend remains.
  • The transformation helps reduce extreme fluctuations, making the series more suitable for modeling.

US Deaths (λ = -0.03)

  • The original data fluctuates around a stable mean, but variance appears inconsistent.
  • The Box-Cox transformation has minimal effect since λ is close to 0, meaning the log transformation would have been a better choice.
  • The series still shows cyclical movement, but variance is slightly more uniform.

Bricksq (λ = 0.25)

  • The original data shows an increasing trend with small fluctuations.
  • The Box-Cox transformation stabilizes the variance and smooths out fluctuations, making it easier to analyze.
  • Some fluctuations remain, but they are less pronounced than in the original series.

Overall: The Box-Cox transformation helps Dole and Bricksq by stabilizing variance. However, for US Deaths, it has little impact since the data was already relatively stable.

3.5 We calculate the residuals from a seasonal naïve forecast applied to the quarterly Australian beer production data from 1992 and determine if the residuals are white noise and normally distributed.

Load and Forecast the Data

# Load Australian beer production dataset
data("ausbeer")

# Select data from 1992 onwards
beer <- window(ausbeer, start=1992)

# Apply a seasonal naïve forecast
fc <- snaive(beer)

# Plot the forecast
autoplot(fc) +
  ggtitle("Seasonal Naïve Forecast of Australian Beer Production") +
  xlab("Year") +
  ylab("Beer Production") +
  theme_minimal()

Residual Analysis

# Extract and plot residuals
res <- residuals(fc)
autoplot(res) +
  ggtitle("Residuals of the Seasonal Naïve Forecast") +
  xlab("Year") +
  ylab("Residual Value") +
  theme_minimal()

Check if Residuals are White Noise

# Perform diagnostic checks
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 32.269, df = 8, p-value = 8.336e-05
## 
## Model df: 0.   Total lags used: 8

Observations

  1. Residuals Analysis:
    • The residual plot shows fluctuations that do not appear completely random, suggesting some structure remains in the data.
    • The residuals still exhibit some seasonal pattern, meaning the seasonal naïve method might not fully capture all underlying trends.
  2. ACF (Autocorrelation Function) Plot:
    • The autocorrelation plot (ACF) shows significant spikes at multiple lags, indicating that the residuals are not purely white noise.
    • Some values remain outside the confidence bands, suggesting there is still correlation present.
  3. Histogram and Normality Check:
    • The histogram of residuals shows some skewness, meaning the residuals are not perfectly normally distributed.
    • The overlaying density curve suggests a deviation from the expected normal shape.
  4. Ljung-Box Test Results:
    • The Ljung-Box test produces a p-value of 8.336e-05, which is very small.
    • Since the p-value is less than 0.05, we reject the null hypothesis that the residuals are white noise.
    • This confirms that there is still correlation in the residuals, and the seasonal naïve method may not be the best forecasting approach.

Conclusion:

  • The residuals are not white noise, meaning the seasonal naïve method does not fully capture the patterns in the beer production data.
  • There is autocorrelation in the residuals, suggesting that a more advanced forecasting model (such as ARIMA with seasonal adjustments) might perform better.
  • The residuals also deviate from normality, further supporting the need for a different forecasting method.

3.7 Are the following statements true or false? Explain your answer.

a. Good forecast methods should have normally distributed residuals.

False.
- While normality is desirable, the most important criterion is that residuals should be uncorrelated and have constant variance (homoscedasticity). - If residuals are correlated, it suggests that there is patterned information left in the data that the model did not capture. - Many forecasting models still perform well even if the residuals are not perfectly normal.

b. A model with small residuals will give good forecasts.

False.
- Small residuals on the training data do not guarantee good forecasts. - A model may have small residuals because it overfits the training data but performs poorly on new data. - The ability to generalize is more important than just minimizing residuals on past observations.

c. The best measure of forecast accuracy is MAPE.

False.
- MAPE (Mean Absolute Percentage Error) is widely used but has limitations, especially when dealing with zero or near-zero values. - Other metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and SMAPE (Symmetric MAPE) may be more appropriate depending on the context.

d. If your model doesn’t forecast well, you should make it more complicated.

False.
- A more complex model is not always better and may lead to overfitting. - Instead, one should: - Check for simpler alternative models. - Tune hyperparameters. - Use regularization techniques. - Consider more relevant features rather than increasing complexity.

e. Always choose the model with the best forecast accuracy as measured on the test set.

False.
- Forecast accuracy is important, but other factors also matter, such as: - Interpretability: A simpler model is often preferred in real-world applications. - Computational cost: Complex models can be difficult to implement in production. - Robustness: Some models might work well in one test set but fail in a different scenario. - A balance between accuracy, interpretability, and generalizability is essential.

3.8 For your retail time series (from Exercise 3 in Section 2.10):

a. Split the Data into Training and Test Sets

# Load the dataset
file_path <- "C:/Users/Admin/Downloads/retail.xlsx"
retaildata <- read_excel(file_path, skip = 1)

# Define column name (update if necessary)
col_name <- "A3349337W"

# Validate column exists
if (!col_name %in% colnames(retaildata)) {
  stop("Error: Column not found in dataset. Check column name.")
}

# Convert to time series
myts <- ts(retaildata[[col_name]], frequency = 12, start = c(1982, 4))

# Split into training (up to Dec 2010) and test (from Jan 2011 onwards)
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

b. Visualizing the Train-Test Split

# Plot train and test sets
autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test") +
  ggtitle("Retail Time Series: Train vs. Test Split") +
  xlab("Year") +
  ylab("Sales") +
  theme_minimal() +
  guides(colour=guide_legend(title="Data Split"))

c. Apply a Seasonal Naïve Forecast

# Fit a seasonal naïve model
fc <- snaive(myts.train)

# Plot the forecast
autoplot(fc) +
  ggtitle("Seasonal Naïve Forecast for Retail Sales") +
  xlab("Year") +
  ylab("Sales") +
  theme_minimal()

d. Compare Forecast Accuracy

# Compute accuracy against the test set
accuracy(fc, myts.test)
##                     ME     RMSE      MAE      MPE      MAPE      MASE      ACF1
## Training set  9.460661 26.30758 21.23363 4.655690 12.762886 1.0000000 0.8070166
## Test set     17.212500 21.26067 17.39583 4.748234  4.807728 0.8192584 0.4843871
##              Theil's U
## Training set        NA
## Test set     0.6934111

e. Residual Analysis

# Check residuals for normality and autocorrelation
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 856.11, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

f. Sensitivity of Accuracy to Training/Test Split

  • The accuracy of forecasts can vary depending on the split point.
  • A different test period may yield higher or lower error metrics due to:
    • Changes in seasonality or trends.
    • External economic factors influencing sales.
    • Data distribution variations.
  • Re-evaluating with different train-test splits can help determine the model’s robustness.

Observations

Train-Test Split (Part a & b)

  • The time series is split into a training set (up to 2010) and a test set (2011 onward).
  • The plot confirms that the split is correct, with the training data covering a longer period and the test data containing the more recent observations.

Forecasting with Seasonal Naïve Method (Part c & d)

  • The seasonal naïve forecast projects that future values will repeat the seasonal pattern from previous years.
  • Test set accuracy shows RMSE = 21.26 and MAE = 17.39, indicating a lower error than the training set, which suggests that the seasonal pattern is somewhat predictable.
  • The Theil’s U statistic = 0.69 indicates that the forecast is better than a naïve mean model but still leaves room for improvement.

Residual Analysis (Part e)

  • The residuals do not appear to be white noise as seen in the residual plot.
  • ACF plot shows high autocorrelation, meaning the residuals are not random, which suggests the model might be missing some important structure.
  • The Ljung-Box test returns a very low p-value (< 2.2e-16), leading us to reject the white noise hypothesis. This confirms that the residuals are correlated and that the seasonal naïve model may not fully capture the underlying patterns.

Sensitivity of Accuracy to Train/Test Split (Part f)

  • The test set performs better than the training set based on lower RMSE and MAE, likely because the seasonal naïve model works well for short-term forecasts.
  • However, if we changed the split, forecasts for different periods could yield different errors, meaning the model’s accuracy is sensitive to the choice of train-test split.

Conclusion

  • The seasonal naïve model captures seasonality well but does not fully account for long-term trends.
  • The residuals are autocorrelated, meaning a more advanced model (like ETS or ARIMA) could improve forecasting.
  • The model’s accuracy is dependent on the chosen train-test split, suggesting that validating across multiple splits (e.g., cross-validation) could improve robustness.

3.9 visnights contains quarterly visitor nights (in millions) from 1998 to 2016 for twenty regions of Australia.

a. Create Training Sets Omitting the Last 1, 2, and 3 Years

# Load dataset
data("visnights")

# Define training sets
train1 <- window(visnights[, "QLDMetro"], end = c(2015, 4)) # Omitting last 1 year
train2 <- window(visnights[, "QLDMetro"], end = c(2014, 4)) # Omitting last 2 years
train3 <- window(visnights[, "QLDMetro"], end = c(2013, 4)) # Omitting last 3 years

# Define corresponding test sets
test1 <- window(visnights[, "QLDMetro"], start = c(2016, 1)) # 1-year test
test2 <- window(visnights[, "QLDMetro"], start = c(2015, 1)) # 2-year test
test3 <- window(visnights[, "QLDMetro"], start = c(2014, 1)) # 3-year test

b. Compute One-Year Forecasts Using Seasonal Naïve Method

# Fit seasonal naïve models
fc1 <- snaive(train1, h=4)
fc2 <- snaive(train2, h=4)
fc3 <- snaive(train3, h=4)

# Plot forecasts
p1 <- autoplot(fc1) + ggtitle("Forecast from Train1") + theme_minimal()
p2 <- autoplot(fc2) + ggtitle("Forecast from Train2") + theme_minimal()
p3 <- autoplot(fc3) + ggtitle("Forecast from Train3") + theme_minimal()

# Arrange plots in a grid
grid.arrange(p1, p2, p3, ncol = 1)

**c. Compare Forecast Accuracy Using MAPE*

# Compute accuracy metrics for each forecast
acc1 <- accuracy(fc1, test1)
acc2 <- accuracy(fc2, test2)
acc3 <- accuracy(fc3, test3)

# Create a data frame to compare accuracy metrics
accuracy_df <- data.frame(
  Training_Set = c("Train1 (Omitting 1 year)", "Train2 (Omitting 2 years)", "Train3 (Omitting 3 years)"),
  MAPE = c(acc1["Test set", "MAPE"], acc2["Test set", "MAPE"], acc3["Test set", "MAPE"]),
  RMSE = c(acc1["Test set", "RMSE"], acc2["Test set", "RMSE"], acc3["Test set", "RMSE"]),
  MAE = c(acc1["Test set", "MAE"], acc2["Test set", "MAE"], acc3["Test set", "MAE"])
)

# Display the accuracy table
knitr::kable(accuracy_df, caption = "Comparison of Forecast Accuracy Metrics")
Comparison of Forecast Accuracy Metrics
Training_Set MAPE RMSE MAE
Train1 (Omitting 1 year) 6.159821 0.9358727 0.7094002
Train2 (Omitting 2 years) 3.057463 0.4117902 0.3133488
Train3 (Omitting 3 years) 8.476977 1.0586583 0.8625501
# Visualization of MAPE values
ggplot(accuracy_df, aes(x = Training_Set, y = MAPE, fill = Training_Set)) +
  geom_bar(stat = "identity", color = "black") +
  ggtitle("MAPE Comparison Across Training Sets") +
  xlab("Training Set") +
  ylab("MAPE (%)") +
  theme_minimal() +
  theme(legend.position = "none")

Observations

  1. MAPE Variability Across Training Sets:
    • The model trained with Train2 (omitting 2 years) had the lowest MAPE (3.06%), suggesting that this training set provided the most accurate forecasts.
    • Train1 (omitting 1 year) had a higher MAPE (6.16%), indicating slightly less accurate predictions.
    • Train3 (omitting 3 years) resulted in the highest MAPE (8.48%), showing that removing too much historical data led to a decline in forecast accuracy.
  2. RMSE and MAE Comparison:
    • Train2 also had the lowest RMSE (0.41) and MAE (0.31), reinforcing that it provided the best predictions.
    • Train1 had slightly higher RMSE (0.94) and MAE (0.71), but still performed better than Train3.
    • Train3 had the highest RMSE (1.06) and MAE (0.86), which aligns with its higher MAPE, indicating that the model struggled more with accuracy when trained on even less data.
  3. Impact of Training Set Size:
    • Reducing training data too much (Train3) leads to higher error rates, likely because the model lacks enough historical patterns to make strong predictions.
    • Train2 (omitting 2 years) hit a sweet spot, balancing enough historical data while avoiding overfitting to the oldest trends.
    • Train1 performed moderately well, but not as strong as Train2.
  4. Takeaway:
    • The optimal training set size appears to be Train2, as it provides the most accurate and reliable forecasts.
    • Removing too much historical data reduces forecasting accuracy, as seen in Train3.
    • Including very recent data only (Train1) may capture some short-term trends but is not necessarily the best for generalization.