Case 3 possible predictor variables:
Make and model
Year of manufacture
Engine Size
Fuel type
Transmission type
Vehicle Type (sedan, SUV, truck)
Mileage at resale
Condition at resale (excellent, good, fair, poor)
Number of previous owners
Original purchase price
Lease duration
Time of year
Case 4 possible predictor variables:
Week of the year
Public Holidays
Day of the week
Flight distance
Ticket prices (economy, business, first class)
Seat availability per class
Competitor airlines’ flight frequencies on the same routes
Fuel prices (maybe affect ticket prices)
Major sporting events in key cities
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
data('aus_arrivals')
aus_arrivals |>
autoplot(Arrivals)
Based on the autoplot() we can see that there is some sort of seasonality. Arrivals from every single country follow the same up and down trend.
All countries except japan are on a steady increase, except for Japan which has a decline after around 2005
aus_arrivals |>
gg_season(Arrivals)
It appears as though all countries except Japan have had an increase in arrivals as years go by.
It’s also visible that each country has a different seasonality, for example:
Japan has higher 1st and 3rd quarters, and lower 2nd and 4th.
NZ has 3rd and 4th quarter as the highest quarters with arrivals, while 1st and 2nd are lowers.
UK has the most prominent behavior as we see 2nd and 3rd quarters always dip, while 4th and 1st are higher.
US has similar behavior to Japan’s, except for the dip in more recent years.
aus_arrivals |>
gg_subseries(Arrivals)
Same as before, all countries except japan show a steady increase throughout the years with a few drops.
Japans followed the same behavior until about 2005 where Arrivals started declining heavily.
data('aus_production')
gas <- tail(aus_production, 5 * 4) |>
select(Gas)
autoplot(gas, Gas)
It is possible to identify a seasonal/trend-cycle.
Q1 always seems to be the lowest of every year, then Q2 increases, Q3 increases from Q2, and finally Q4 drops lower than Q2.
gas_decomp <- gas |>
model(classical_decomposition(Gas, type = 'multiplicative'))
#extract components
components <- gas_decomp |>
components()
head(components)
## # A dable: 6 x 7 [1Q]
## # Key: .model [1]
## # : Gas = trend * seasonal * random
## .model Quarter Gas trend seasonal random season_adjust
## <chr> <qtr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "classical_decomposition(Ga… 2005 Q3 221 NA 1.13 NA 196.
## 2 "classical_decomposition(Ga… 2005 Q4 180 NA 0.925 NA 195.
## 3 "classical_decomposition(Ga… 2006 Q1 171 200. 0.875 0.974 195.
## 4 "classical_decomposition(Ga… 2006 Q2 224 204. 1.07 1.02 209.
## 5 "classical_decomposition(Ga… 2006 Q3 233 207 1.13 1.00 207.
## 6 "classical_decomposition(Ga… 2006 Q4 192 210. 0.925 0.987 208.
autoplot(components)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
The results do support the graphical interpretation from part A.
The seasonal component is consistent across the years, so there is a recurring seasonal pattern.
Every year, Q1 is the lowest, Q4 the second lowest, and Q3 the highest
We can also see that it the data has a slight upward trend, meaning that even though it’s seasons go up and down, it is steadily trending higher.
gas_adjusted <- components |>
mutate(seasonally_adjusted = trend * random)
autoplot(gas, Gas) +
autolayer(gas_adjusted, seasonally_adjusted, color = 'red', linetype = 'dashed')
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
gas_outlier <- gas
gas_outlier$Gas[10] <- gas_outlier$Gas[10] + 300 #modifying the 10th observation
#recompute decomposition
gas_decomp_outlier <- gas_outlier |>
model(classical_decomposition(Gas, type = 'multiplicative'))
components_outlier <- gas_decomp_outlier |>
components()
autoplot(components_outlier)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
The main effect of the outlier comes when looking at the trend. Instead of having a slight upward trend, we see a huge increase right around where the outlier is, and then again the trend resuming.
Seasonality seem to be affected, mainly for the fact that now Q4 appears to be the highest quarter. Likely to do with the fact that the outlier was induced in Q4, and the seasonal component adjusted to account for the unusually high value.
Yes, if the outlier is near the end rather than the middle it will
make a difference.
If it is near the end, the decomposition algorithm might just
over-adjust the last few observations.
If it is near the middle, it may have a more distributed effect across the trend and seasonality.
gas_outlier_end <- gas
gas_outlier_end$Gas[19] <- gas_outlier_end$Gas[19] + 300
#recompute decomposition
gas_decomp_outlier_end <- gas_outlier_end |>
model(classical_decomposition(Gas, type = 'multiplicative'))
components_outlier_end <- gas_decomp_outlier_end |>
components()
autoplot(components_outlier_end)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
As expected, the trend over-adjusts at the end, and the seasonal component remains very similar to the original plot.
set.seed(42)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
#create training dataset of observations before 2011
myseries_train <- myseries |>
filter(year(Month) < 2011)
#check that data was split properly
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
#fit seasonal naive model
fit <- myseries_train |>
model(SNAIVE(Turnover))
#check residuals
fit |>
gg_tsresiduals()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).
Residuals do not appear to be uncorrelated. ACF plot shows multiple points outside threshold, and lags seem to be correlated.
Also doesn’t look like they are normally distributed.
#forecasts for the test data
fc <- fit |>
forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |>
autoplot(myseries)
#compare accuracy of forecast against actual values
fit |>
accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western… Newspap… SNAIV… Trai… 0.838 5.67 4.09 2.49 16.0 1 1 0.849
fc |>
accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Newspap… Test 3.04 9.11 7.34 6.06 19.9 1.79 1.61 0.513
Results on Training Data show an RMSE of 5.67, and an ACF1 of 0.849 which is quite high. This indicates some correlation in residuals
Model fits data reasonably well, but high ACF1 suggests residuals might still have patterns.
Results on Test Data show an increase in RMSE to 9.11 so error is higher for the test set.
ACF 1 is lower, so the residuals in the test data have less correlation compared to training.
The accuracy measures are sensitive to the amount of training data used. How sensitive, depends on the amount of data. Usually more training data can improve stability and trend detection, but too much may include outdated patterns. Less data focuses on recent trends but risks overfitting short-term fluctuations.
library(tsibbledata)
data('global_economy')
#filter to only include afghanistan
afg_pop <- global_economy |>
filter(Country == 'Afghanistan')
Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?
afg_pop |>
autoplot(Population)
Yes, it is possible to see the impact of the Soviet-Afghan war. Population noticeably dips during the years of war, and resumed the trend afterwards.
# linear and piecewise models
models <- afg_pop |>
model(
linear = TSLM(Population ~ trend()),
piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
)
augment(models) |>
autoplot(.fitted) +
geom_line(aes(y = Population), color = 'black')
fc2 <- models |>
forecast(h = '5 years')
autoplot(fc2) +
autolayer(afg_pop, Population)
Linear model clearly performs poorly compared to piecewise model. Not only are the predictions likely incorrect, but the confidence intervals are huge.
On the other hand, piecewise model looks really good but prediction intervals could be too narrow. This model uses the trend after 1989 since the knot does not include the data from the dip.
#select a country to analyze
mex_economy <- global_economy |>
filter(Country == 'Mexico')
#plot the exports
mex_economy |>
autoplot(Exports)
There’s no evident seasonality in our data.
In the long term, the trend is increasing, but it seems like we have a few points/outliers where the increase or decrease is a lot higher than expected.
We can determine that the Exports for Mexico show a few sudden shifts.
ets_ann <- mex_economy |>
model(ETS(Exports ~ error('A') + trend('N') + season('N')))
#generate forecasts
fc_ann <- ets_ann |>
forecast(h = '5 years')
#plot forecasts
fc_ann |>
autoplot(mex_economy)
#compute RMSE for training data
accuracy(ets_ann)
## # A tibble: 1 × 11
## Country .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mexico "ETS(Exports ~ … Trai… 0.506 2.15 1.38 1.83 7.78 0.983 0.991 0.203
RMSE: 2.154425
ets_aan <- mex_economy |>
model(ETS(Exports ~ error('A') + trend('A') + season('N')))
#forecast for aan
fc_aan <- ets_aan |>
forecast(h = '5 years')
#plot
fc_aan |>
autoplot(mex_economy)
accuracy(ets_aan)
## # A tibble: 1 × 11
## Country .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mexico "ETS(Exports… Trai… -0.00816 2.09 1.41 -1.97 8.68 1.01 0.964 0.203
Discuss the merits of the two forecasting methods for this dataset
ETS(A,N,N) is a simpler model with fewer parameters. It is less prone to over fitting and would work well if exports fluctuate around a relatively stable mean without a strong upward or downward trend. Since our data is showing a strong upward trend, it under performs.
ETS(A,A,N) captures a linear trend, making it more suitable for data with sustained growth or decline. It also provides a more dynamic forecast. It usually yields lower RMSE compared to (A,N,N) when a trend is present. One of the drawbacks is that if the trend is temporary, or changes direction, the model may overestimate future values.
#compare forecasts
combined_forecast <- bind_rows(
fc_ann |>
mutate(model = 'ETS(A,N,N)'),
fc_aan |>
mutate(model = 'ETS(A,A,N)')
)
autoplot(mex_economy, Exports) +
autolayer(combined_forecast, .mean)
Personally, I’d argue that the (A,A,N) model is better.
We know that it works better when data shows a strong trend, which in our case it does upwards.
It also has a lower RMSE during training, and it is probably better for long term foracasting since (A,N,N) assumes stationary process and our data isn’t.
#extract first forecast for each model
y_hat_ann <- fc_ann$.mean[1]
y_hat_aan <- fc_aan$.mean[1]
#compute 95% prediction interval
z_score <- 1.96 # for 95% confidence
#RMSE for each model
accuracy_ann <- accuracy(ets_ann) |>
filter(.type == 'Training') |>
pull(RMSE)
accuracy_aan <- accuracy(ets_aan) |>
filter(.type == 'Training') |>
pull(RMSE)
ci_ann <- c(y_hat_ann - z_score * accuracy_ann, y_hat_ann + z_score * accuracy_ann)
ci_aan <- c(y_hat_aan - z_score * accuracy_aan, y_hat_aan + z_score * accuracy_aan)
ci_ann
## [1] 33.64367 42.08901
ci_aan
## [1] 34.27800 42.48647
ANN Confidence Interval 95%:
33.64367 - 42.08901
AAN Confidence Interval 95%:
34.2780 - 42.48647
#R confidence Intervals
hilo(fc_ann, level = 95)[[1, 6]]
## <hilo[1]>
## [1] [33.569, 42.16368]95
hilo(fc_aan, level = 95)[[1, 6]]
## <hilo[1]>
## [1] [34.12878, 42.63569]95
As we can see, both confidence intervals for ANN are similar, as well as the confidence intervals for AAN.
set.seed(42)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
#time plot for the series
ggplot(sim, aes(x = idx, y = y)) +
geom_line()
As we increase φ₁, the time series will show a stronger persistence in the data, and larger values of φ₁ will make the process less volatile.
On the other hand, if we decrease φ₁, the series will show more oscillation around the mean.
#MA(1) model with θ₁ = 0.6 and σ² = 1
set.seed(42)
y_ma1 <- numeric(100)
e_ma1 <- rnorm(100)
for(i in 2:100) {
y_ma1[i] <- e_ma1[i] + 0.6 * e_ma1[i-1]
}
sim_ma1 <- tsibble(idx = seq_len(100), y = y_ma1, index = idx)
#plot the series
#time plot for the series
ggplot(sim_ma1, aes(x = idx, y = y)) +
geom_line()
Smaller φ₁ values makes the series have a more random scatter, while a larger one makes it smoother. Being smoother also means that values are more influenced by past noise terms
set.seed(42)
#ARMA(1, 1) model
y_arma11 <- numeric(100)
e_arma11 <- rnorm(100)
for(i in 2:100) {
y_arma11[i] <- 0.6 * y_arma11[i-1] + e_arma11[i] + 0.6 * e_arma11[i-1]
}
sim_arma11 <- tsibble(idx = seq_len(100), y = y_arma11, index = idx)
set.seed(42)
#AR(2) model
y_ar2 <- numeric(100)
e_ar2 <- rnorm(100)
for(i in 3:100) {
y_ar2[i] <- -0.8 * y_ar2[i-1] + 0.3 * y_ar2[i-2] + e_ar2[i]
}
sim_ar2 <- tsibble(idx = seq_len(100), y = y_ar2, index = idx)
#plot the series
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#ARMA(1,1) plot
p1 <- ggplot(sim_arma11, aes(x = idx, y = y)) +
geom_line()
#AR(2) plot
p2 <- ggplot(sim_ar2, aes(x = idx, y = y)) +
geom_line()
grid.arrange(p1, p2, nrow = 2)
The plot for our ARMA(1, 1) shows similar behavior to our past plots. On the other hand, the AR(2) plot does not satisfy stationary conditions, and it exhibits an explosive behavior. The oscillations start small, and over time increase like a growing sound wave. The combination of φ₁ and φ2 causes the oscillations’ variance to grow. Over time, the values deviate more and more from the mean, which makes this series appear increasingly unstable.