Q1:

Good forecast methods should have normally distributed residuals. False. While normally distributed residuals are ideal for some statistical methods like linear regression, it’s not a strict requirement for all forecast methods. Other methods, such as machine learning algorithms, may not assume normality, and the effectiveness of a forecasting method depends on various factors beyond the distribution of residuals.
A model with small residuals will give good forecasts. True, with caveats. Generally, smaller residuals indicate that the model is making predictions close to the actual values. However, it’s important to consider other factors like the scale of the data and the context of the problem. Additionally, a model with small residuals on the training data may not generalize well to new, unseen data, so it’s crucial to evaluate performance on a separate test set.
The best measure of forecast accuracy is MAPE. False. The choice of the best measure depends on the characteristics of the data and the specific goals of the forecasting task. MAPE has its advantages, but it may also have limitations, such as sensitivity to outliers. Other metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) are commonly used, and the choice depends on the context and the nature of the data.
If your model doesn’t forecast well, you should make it more complicated. Not necessarily. Increasing model complexity may improve performance up to a point, but it can also lead to overfitting, where the model performs well on the training data but poorly on new data. It’s essential to balance model complexity with generalization to ensure the model captures underlying patterns without fitting too closely to noise in the training data.
Always choose the model with the best forecast accuracy as measured on the test set. False. While test set performance is crucial, it’s important to consider the complexity of the model, interpretability, and computational efficiency. The best model depends on the specific requirements and constraints of the forecasting problem, and a balance must be struck between accuracy and other practical considerations.

Q2:

# Q2) Consider the number of pigs slaughtered in New South Wales (data set aus_livestock).
# (60 points)
library(fpp3)

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.3
## ✔ dplyr       1.1.3     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.1
## ✔ lubridate   1.9.2     ✔ fable       0.3.3
## ✔ ggplot2     3.4.3     ✔ fabletools  0.3.4

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

aus_livestock

aus_livestock |>
  mutate(Month = yearmonth(Month)) |>
  as_tsibble(index = Month)

# 1)
# Produce time and seasonal plots of the data, in order to become familiar with it.

pigs <- aus_livestock |>
  filter(Animal == 'Pigs' & State == 'Victoria')

pigsSlaughteredInVictoriaPlot <- pigs |>
  autoplot(Count) +
  labs(title = 'Pigs Slaughtered in Victoria Timeseries')
pigsSlaughteredInVictoriaPlot

# 2)
# Create a training set of 486 observations, withholding a test set of 72 observations (6 years).
pigs_trained <- pigs |>
  filter(year(Month) < 2013)

# 3)
# Try using various benchmark methods to forecast the training set and compare the results on the test set. Which method did best?
pigs_trained_fitted <- pigs_trained |>
  model(
    Mean = MEAN(Count),
    Naive = NAIVE(Count),
    Seasonal_Naive = SNAIVE(Count),
    Drift = RW(Count ~ drift())
  ) 

q23 <- pigs_trained_fitted |>
  forecast(h = 12) |>
  autoplot(pigs_trained, level = NULL) +
  labs(title = "Forecasts for quarterly pig count") +
  guides(color = guide_legend(title = "Forecast"))

q23

# 4)
# Check the residuals of your preferred method. Do they resemble white noise?

q24 <- pigs_trained |>
  model(SNAIVE(Count))

q24 |>
  gg_tsresiduals() + ggtitle("Residual")

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

They don’t see to resemble white noise and there is also no correlation between them.

Q3 :

# Q3) tourism contains quarterly visitor nights (in thousands) from 1998 to 2017 for 76 regions of Australia. (80 points)
library(fpp3)
library(dplyr)

tourism

# 1)
# Extract data from the Gold Coast region using filter() and aggregate total overnight trips (sum over Purpose) using summarise(). Call this new dataset gc_tourism.

gc_tourism <-  tourism |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(key = c("Region", "Purpose"), index = "Quarter") |>
  filter(Region %in% c("Gold Coast")) |>
  summarize(total_trips = sum(Trips))

# 2)
# Using slice() or filter(), create three training sets for this data excluding the last 1, 2 and 3 years. For example, gc_train_1 <- gc_tourism |> slice(1:(n()-4)).

gc_train_less1 <- gc_tourism |> slice(1:(n()-1))
gc_train_less2 <- gc_tourism |> slice(1:(n()-2))
gc_train_less3 <- gc_tourism |> slice(1:(n()-3))

# 3)
# Compute one year of forecasts for each training set using the seasonal naïve (SNAIVE()) method. Call these gc_fc_1, gc_fc_2 and gc_fc_3, respectively.
gc_fc_1 <- gc_train_less1 |>
  model(Snaive = SNAIVE(total_trips)) |>
  forecast(h = 4)

gc_fc_2 <- gc_train_less2 |>
  model(Snaive = SNAIVE(total_trips)) |>
  forecast(h = 4)

gc_fc_3 <- gc_train_less3 |>
  model(Snaive = SNAIVE(total_trips)) |>
  forecast(h = 4)


# 4)
# Use accuracy() to compare the test set forecast accuracy using MAPE. Comment on these.

gc_fc_1_accuracy <- gc_fc_1 |> accuracy(gc_tourism)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 3 observations are missing between 2018 Q1 and 2018 Q3

gc_fc_2_accuracy <- gc_fc_2 |> accuracy(gc_tourism)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 2 observations are missing between 2018 Q1 and 2018 Q2

gc_fc_3_accuracy <- gc_fc_3 |> accuracy(gc_tourism)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 1 observation is missing at 2018 Q1

q34 <- bind_rows(
  select(gc_fc_1_accuracy, MAPE),
  select(gc_fc_2_accuracy, MAPE),
  select(gc_fc_3_accuracy, MAPE))

q34 <- mutate(q34, TestSet = c("gc_fc_1_accuracy", "gc_fc_2_accuracy", "gc_fc_3_accuracy"))
  
  
#print(q34)

A lower MAPE generally indicates better forecast accuracy, so in this example, the forecast forecast using 3 years less has the lowest MAPE, suggesting it is relatively more accurate compared to the forecasts using 1 and 2 years less.

Q4

# Q4) Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures. (90 points)
jan14_vic_elec <- vic_elec |>
filter(yearmonth(Time) == yearmonth("2014 Jan")) |>
index_by(Date = as_date(Time)) |>
summarise( Demand = sum(Demand), Temperature = max(Temperature))
# 1)
# Plot the data and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?

jan14_vic_elec |>
  gather("Variable", "Value", Demand, Temperature) |>
  ggplot(aes(x = Date, y = Value)) +
  geom_line() +
  facet_grid(vars(Variable), scales = "free_y")

jan14_vic_elec |>
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

# 2)
# Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
fit <- jan14_vic_elec |>
  model(TSLM(Demand ~ Temperature))

fit |> gg_tsresiduals()

# 3)
# Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15°C and compare it with the forecast if the maximum temperature was 35°C. Do you believe these forecasts? The following R code will get you started: 

#wanted to use my own code here so they can both show in one graph

next_day <- scenarios(
  `Cold day` = new_data(jan14_vic_elec, 1) |> mutate(Temperature = 15),
  `Hot day` =  new_data(jan14_vic_elec, 1) |> mutate(Temperature = 35))

fc <- fit |>
  forecast(new_data = next_day)
  autoplot(jan14_vic_elec, Demand) +
  autolayer(fc)

# 4)
# Give prediction intervals for your forecasts.

# # model_fit <- lm(Demand ~ jan14_vic_elec$Temperature, data = jan14_vic_elec)
# # 
# # prediction_intervals <- predict(model_fit, newdata = new_data, interval = "prediction")
# 
# print(prediction_intervals)

# 5)
# Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?

daily_aggregated_data <- vic_elec %>%
  mutate(Date = as.Date(Time)) %>%
  group_by(Date) %>%
  summarise(Demand = sum(Demand), Temperature = max(Temperature))

# Plot Demand vs Temperature
ggplot(daily_aggregated_data, aes(x = Temperature, y = Demand)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Demand vs Temperature (Aggregated)",
       x = "Temperature (°C)",
       y = "Demand (kWh)")

## `geom_smooth()` using formula = 'y ~ x'

1) the positive relationship is due to the fact that are temperature increases, more electricity is needed for cooling.

The model is adequate. On Jan 27 there are 2 outliers, which can also be seen in the histogram where the left of it is higher and it doesn’t appear to be normally distributed.
The forecast with the max was 35 degrees forecast seems appropriate but the one with 15 looks too low even compared to the historic lows.
This tells me that the data follows a U shape and not a linear trend as suggested by the smoothed linear trend line.

Q5

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Q5) The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff. (80 points)
# 1)
# Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

souvenirs %>% autoplot(Sales)

# 2)
# Explain why it is necessary to take logarithms of these data before fitting a model.
souv_num <- souvenirs
souv_num$Sales <- as.numeric(souvenirs$Sales)
hist(souv_num$Sales)

souv_num_log <- souv_num
souv_num_log$Sales <- log(souv_num$Sales)
hist(souv_num_log$Sales)

# 3)
# Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

souv<- ts(souvenirs$Sales,start=c(1987,1), end=c(1993,6), frequency=12)
souv_log = log(souv)

d_festival = rep(0, length(souv))
d_festival[seq_along(d_festival)%%12 == 3] = 1
d_festival[3] = 0
d_festival = ts(d_festival, freq = 12, start=c(1987,1))
new_data = data.frame(souv_log, d_festival)
fit = tslm(souv_log ~ trend + season + d_festival, data=new_data)

forecast1 = data.frame(d_festival = rep(0, 12))
forecast1[3,] = 1
forecast(fit, newdata=forecast1)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jul 1993       9.883136  9.637329 10.128943  9.503919 10.262353
## Aug 1993       9.867772  9.621965 10.113579  9.488555 10.246989
## Sep 1993      10.531942 10.191069 10.872815 10.006062 11.057822
## Oct 1993      10.092605  9.846798 10.338412  9.713388 10.471822
## Nov 1993      10.585189 10.339382 10.830995 10.205971 10.964406
## Dec 1993      11.357556 11.111749 11.603363 10.978339 11.736773
## Jan 1994       9.430933  9.186093  9.675774  9.053207  9.808659
## Feb 1994       9.704370  9.459530  9.949210  9.326644 10.082096
## Mar 1994       9.695742  9.365756 10.025727  9.186658 10.204825
## Apr 1994       9.881046  9.636206 10.125887  9.503320 10.258772
## May 1994       9.928500  9.683659 10.173340  9.550774 10.306226
## Jun 1994       9.989861  9.745020 10.234701  9.612135 10.367587

# 4)Plot the residuals against time and against the fitted values.
#  Do these plots reveal any problems with the model?


res <- residuals(fit)

fit_vector <- as.vector(fitted(fit))
fit_residuals <- as.vector(residuals(fit))

plot(res, xlab = "Year", ylab = "Residuals")

plot(fit_vector, fit_residuals, xlab = "Predicted scores", ylab = "Residuals")

# 5)
# Do boxplots of the residuals for each month. Does this reveal any problems with the model?
boxplot(resid(fit) ~ cycle(resid(fit)))

# 6)
# What does the Ljung-Box test tell you about your model?

#I can't get the ggresidual to work , but here I would check if there's any abonormalities in the acf and then check the p value of that lag
ACF(souvenirs)|>autoplot()

## Response variable not specified, automatically selected `var = Sales`

Box.test(residuals(fit), lag = 1, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit)
## X-squared = 17.761, df = 1, p-value = 2.505e-05

Box.test(residuals(fit), lag = 12, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(fit)
## X-squared = 43.874, df = 12, p-value = 1.605e-05

1)Features of the data. It’s seasonal, we see the same pattern repeat every year. There’s a spike every year except March 1987 because the festival started the year after that. The sales increase multiplicatively throughout the years.

In some cases, the variance of a variable increases with its mean. Taking the logarithm can stabilize the variance, making the data more homoscedastic. Certain relationships in data are better represented on a logarithmic scale. For example, if one variable is a constant percentage change in another, taking the logarithm can linearize this relationship. This can make it easier to interpret and model using linear regression.

Applying logarithmic transformations often leads to more homogenous residuals, which is a key assumption in many statistical models.

3)As we can see in our second graph, taking the log removes the skewed data and it now more closely reflects a normal distribution.

4)It seems that 1991 is much lower than the other years. The random scatter suggests the errors are homoscedastic

5)There’s less variance at the middle of the year compared to the beginning and end implying the model might not be capturing information relevant to the time period.

6)There is an abonormality at lag 1 and 12. The p values for both are much lower than 5%. Thus, we reject the null hypothesis of the test and conclude that the data values are dependent and require further investigation.

Good luck!

ADM 4307: Assignment 3

This was really hard!

Sumayya Kheireddine

2023-12-05

Q1:

Q2:

Q3 :

Q4

Q5