HW 2

library(fpp3)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9

  1. Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
global_economy_per_capita <- global_economy |>
  mutate(gdp_per_capita = GDP/Population)

global_economy_per_capita |>
  autoplot(gdp_per_capita, show.legend=FALSE) + # legend is way too large to include
  geom_point(show.legend=FALSE) +
  labs(title= "GDP per capita", y = "$US")
## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 3325 rows containing missing values or values outside the scale range
## (`geom_point()`).

The legend is too large to fit on the graph, but in 2000, we can easily see there’s a country with a strong lead, so let’s filter by that year and sort in descending order:

# In 2000:
global_economy_us_recent <- global_economy_per_capita |>
  filter(Year == "2000") |>
  arrange(desc(gdp_per_capita))
global_economy_us_recent

So the country with the highest GDP per capita is Monaco. Monaco’s GDP per capita has overall increased over time, with some variance. Monaco has 2 larger troughs at 1984 and 2000, with the highest peak in 2014. There has been a few years where another country has taken the lead, but for the most part Monaco has always had the highest. In 1976, United Arab Emirates had the highest but 1978 lost the lead.

# In 1976:
global_economy_us_recent <- global_economy_per_capita |>
  filter(Year == "1976") |>
  arrange(desc(gdp_per_capita))
global_economy_us_recent

In more recent years, Liechtenstein has gotten closer to Monaco, actually surpassing in 2013 and 2015.

  1. For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

Original data:

global_economy_us <- global_economy |>
  filter(Country == "United States")

global_economy_us |>
  autoplot(GDP) + 
  geom_point() +
  labs(title= "GDP", y = "$US")

Let’s make a population adjustment to make it per-capita:

global_economy_us_per_capita <- global_economy |>
  filter(Country == "United States") |>
  mutate(gdp_per_capita = GDP/Population) |>
  select(gdp_per_capita)

global_economy_us_per_capita |>
  autoplot(gdp_per_capita) + 
  geom_point() +
  labs(title= "GDP per capita", y = "$US")

Original data:

# original data
aus_livestock_victorian_bulls_bullocks_steers <- aus_livestock |>
  filter(State == "Victoria", Animal == "Bulls, bullocks and steers") |>
  select(Count)

aus_livestock_victorian_bulls_bullocks_steers |>
  autoplot(Count)

There isn’t a lot of variance as time increases, but let’s see what the box cox transformation does:

# guerrero box cox
lambda_aus_livestock <- aus_livestock_victorian_bulls_bullocks_steers |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)
lambda_aus_livestock
## [1] -0.04461887
aus_livestock_victorian_bulls_bullocks_steers |>
  autoplot(box_cox(Count, lambda_aus_livestock))

This did not do much. The only other possible adjustment I would try doing is converting monthly data to daily average data which would probably smooth the graph a bit more. Some months have more days than others, so those counts will be higher. I tried to use monthdays to compute the daily average, but wasn’t able to get it to work for the tsibble.

Original data:

vic_elec |>
  autoplot(Demand)

It is very hard to read this graph due to all the data, but for the most part, the variance stays pretty consistent with a few larger peaks during the end/start of each year.

Here is a box-cox transformation in case it makes the graph any more even:

lambda_elec <- vic_elec |>
  features(Demand, features = guerrero) |>
  pull(lambda_guerrero)
lambda_elec
## [1] 0.09993089
vic_elec |>
  autoplot(box_cox(Demand, lambda_elec))

Box-cox transformation expectedly didn’t do much.

Let’s try updating the time interval of the data to make the graph a bit easier to read. Below the time interval has been changed to daily:

vic_elec_daily <- vic_elec |>
  index_by(Day = as_date(Time)) |>
  summarise(
    Demand = sum(Demand)
  )

vic_elec_daily |>
  autoplot(Demand)

For the most part the variance stays pretty consistent in this graph as well.

Let’s see if the box-cox transformation will help at all:

lambda_elec_daily <- vic_elec_daily |>
  features(Demand, features = guerrero) |>
  pull(lambda_guerrero)
lambda_elec_daily
## [1] -0.8999268
vic_elec_daily |>
  autoplot(box_cox(Demand, lambda_elec_daily))

The box-cox transformation didn’t do much for this either, so no transformation is needed.

Original data:

aus_production |>
  autoplot(Gas)

There is a lot of variation in this graph - after 1970, the size of the peaks and troughs grow increasingly larger.

Let’s try a box-cox transformation to even out the graph:

lambda_gas <- aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)
lambda_gas
## [1] 0.1095171
aus_production |>
  autoplot(box_cox(Gas, lambda_gas))

The variation is a lot more even now, making the graph easier to read. Using this lambda makes the size of the seasonal variation the same across the whole series.

  1. Why is a Box-Cox transformation unhelpful for the canadian_gas data?
# original data:
canadian_gas |>
  autoplot(Volume)

# box cox transformation
canadian_gas_lambda_guerrero <- canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)
canadian_gas_lambda_guerrero
## [1] 0.5767648
canadian_gas |>
  autoplot(box_cox(Volume, canadian_gas_lambda_guerrero))

The box-cox produces pretty much an identical graph to the original making it not very useful. This probably means that the data is already normally distributed.

  1. What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

Original data:

set.seed(100)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |>
  autoplot(Turnover)

To make it more even, I would do the following:

lambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)
lambda
## [1] 0.1571524
myseries |>
  autoplot(box_cox(Turnover, lambda))

  1. For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

Tobacco from aus_production:

Original data:

aus_production |>
  autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

To improve:

lambda_tobacco <- aus_production |>
  features(Tobacco, features = guerrero) |>
  pull(lambda_guerrero)
lambda_tobacco
## [1] 0.9264636
aus_production |>
  autoplot(box_cox(Tobacco, lambda_tobacco))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

Lambda is close to one so no transformation.

Economy class passengers between Melbourne and Sydney from ansett:

Original data:

ansett_econ_btw_melb_syd <- ansett |>
  select(Passengers) |>
  filter(Class == "Economy", Airports == "MEL-SYD")

ansett_econ_btw_melb_syd |>
  autoplot(Passengers)

To improve:

lambda_passengers <- ansett_econ_btw_melb_syd |>
  features(Passengers, features = guerrero) |>
  pull(lambda_guerrero)
lambda_passengers
## [1] 1.999927
ansett_econ_btw_melb_syd |>
  autoplot(box_cox(Passengers, lambda_passengers))

This made the troughs around 1987 and 1992 look larger, but the overall the variance looks slightly more stable now.

Pedestrian counts at Southern Cross Station from pedestrian:

Original data:

pedestrian_southern_cross_station <- pedestrian |>
  filter(Sensor == "Southern Cross Station")

pedestrian_southern_cross_station |>
  autoplot(Count)

To improve:

lambda_pedestrians <- pedestrian_southern_cross_station |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)
lambda_pedestrians
## [1] -0.2501616
pedestrian_southern_cross_station |>
  autoplot(box_cox(Count, lambda_pedestrians))

This graph does look more uniform. Out of curiosity, let’s try daily time interval:

pedestrian_southern_cross_station_daily <- pedestrian_southern_cross_station |>
  group_by_key() |>
  index_by(Date) |> # group by Date and use it as new index
  summarise(Count = sum(Count))
pedestrian_southern_cross_station_daily
pedestrian_southern_cross_station_daily |> 
  autoplot(Count)

lambda_pedestrians_daily <- pedestrian_southern_cross_station_daily |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)
lambda_pedestrians_daily
## [1] 0.2726316
pedestrian_southern_cross_station_daily |>
  autoplot(box_cox(Count, lambda_pedestrians_daily))

This didn’t do all that much, but the variance does look more stable than the original graph.

  1. Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) |> select(Gas)
  1. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
gas |>
  autoplot(Gas) +
  geom_point()

There is a seasonal pattern here where there is a trough at Q1, it then increases for Q2, reaches a peak at Q3 and drops at Q4 with a value always higher than Q1 but lower than Q2 and Q3. The overall trend is increasing.

This makes sense since Q3 is winter time, and more people will need to travel by car due to weather.

  1. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) |>
  components() |>
  autoplot()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

  1. Do the results support the graphical interpretation from part a?

Yes, the seasonal indices show that there is trough for Q1 and a peak during the winter for Q3.

  1. Compute and plot the seasonally adjusted data.
dcmp <- gas |>
  model(stl = STL(Gas))
components(dcmp)
gas |>
  autoplot(Gas, color = "gray") +
  autolayer(components(dcmp), season_adjust, color = "#0072B2")

  1. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
gas_tibble <- gas |>
  as_tibble()
gas_tibble[1,1] = gas_tibble[1,1] + 300

gas_updated <- gas_tibble |>
  as_tsibble(index = Quarter)
dcmp_updated <- gas_updated |>
  model(stl = STL(Gas))
components(dcmp_updated)
gas_updated |>
  autoplot(Gas, color = "gray") +
  autolayer(components(dcmp_updated), season_adjust, color = "#0072B2")

This graph is way less smooth than the original one. It shows the major drop from the first observation to the second, and shows more variation. It doesn’t seem like this graph gives us a lot of insight compared to the original one.

  1. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Middle outlier:

# Choose middle outlier this time
gas_tibble_middle_outlier <- gas |>
  as_tibble()
gas_tibble_middle_outlier[10,1] = gas_tibble_middle_outlier[10,1] + 300

gas_updated_middle_outlier <- gas_tibble_middle_outlier |>
  as_tsibble(index = Quarter)
dcmp_updated_middle_outlier <- gas_updated_middle_outlier |>
  model(stl = STL(Gas))
components(dcmp_updated_middle_outlier)
gas_updated_middle_outlier |>
  autoplot(Gas, color = "gray") +
  autolayer(components(dcmp_updated_middle_outlier), season_adjust, color = "#0072B2")

End outlier:

# Choose end outlier this time
gas_tibble_end_outlier <- gas |>
  as_tibble()
gas_tibble_end_outlier[20,1] = gas_tibble_end_outlier[10,1] + 300

gas_updated_end_outlier <- gas_tibble_end_outlier |>
  as_tsibble(index = Quarter)
dcmp_updated_end_outlier <- gas_updated_end_outlier |>
  model(stl = STL(Gas))
components(dcmp_updated_end_outlier)
gas_updated_end_outlier |>
  autoplot(Gas, color = "gray") +
  autolayer(components(dcmp_updated_end_outlier), season_adjust, color = "#0072B2")

The point in time where the outlier lives does not seem to make a big difference here. These graphs show a lot of variation so it would make sense to deal with the outliers before doing seasonal adjustment to get more accurate data.

  1. Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

Original data:

set.seed(100)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |>
  autoplot(Turnover)

X-11 decomposition:

library(seasonal)
## 
## Attaching package: 'seasonal'
## The following object is masked from 'package:tibble':
## 
##     view
x11_dcmp <- myseries |>
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components()
autoplot(x11_dcmp)

The irregular graph shows a peak happening around 1986, a lesser one around 1990 and then again in 2009.

  1. Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
  1. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The general trend is increasing. The seasonal data shows strong seasonality. The seasonal graph has the largest grey bar, showing that the variation in the seasonal data is the smallest compared to the data. The remainder has a lot more variance, which makes sense. It also shows the big variation in 1991. This graph has the second largest grey bar, meaning that the remainder variance is larger than the seasonal data variation.

  1. Is the recession of 1991/1992 visible in the estimated components?

Yes you can see in the remainder graph that something happened in 1991 causing a major drop in the labor force. In figure 3.20, you can see this occurred specifically in March as the trend drops in 1991.