Data 624 Homework 2

library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)

3.1 Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

max_gdp = global_economy |>
  filter(Year == 2017) |>
  slice_max(order_by = GDP / Population, n = 3)

global_economy |>
  autoplot(GDP / Population) +
  theme_bw() +
  theme(legend.position = "none") +
  geom_label_repel(data = max_gdp, aes(label = Country), min.segment.length = 0, 
                   direction = "y", nudge_x = 10) +
  labs(x = "Year", y = "GDP per Capita ($USD)", title = "GDP per Capita Since 1960")

## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

Based on the plot above, the top country in terms of GPD per Capita in 2017 was Luxembourg. However, these labels don’t include the 2 highest countries from previous years. As a result, the data was looked at further to find the maximum instances, regardless of year.

global_economy |>
  mutate(GDP_per_Capita = GDP / Population) |>
  arrange(desc(GDP_per_Capita))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `Country`, `Year` first.
## Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `Country`, `Year` first.

Looking at the above tsibble, 2 of the top 3 years overall in terms of GDP per Capita occurred in 2014, one by Monaco and one by Liechtenstein. The reason neither of these countries were included in the original plot, given that they were clearly the top 2 countries in terms of GDP per Capita over the previous 20+ years, is that both Monaco and Liechtenstein have an NA value for the GDP, as seen in the tsibble below. Thus, the GDP per Capita could not be calculated for these 2 countries in 2017.

global_economy |>
  filter(Year == 2017, Country %in% c("Monaco", "Liechtenstein"))

In general, the GDP per Capita has increased over time for most countries. Additionally, the spread between the highest and lowest countries has increased over time. Finally, based on the tsibble below, the country with the highest GDP per Capita was the United States in 1960, whereas in 2017, as previously discussed, it was Luxembourg.

global_economy |>
  filter(Year == 1960) |>
  mutate(GDP_per_Capita = GDP / Population) |>
  arrange(desc(GDP_per_Capita))

3.2 For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy.
Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.
Victorian Electricity Demand from vic_elec.
Gas production from aus_production.

global_economy |>
  filter(Country == "United States") |>
  autoplot(GDP) +
  labs(x = "Year", y = "GDP ($USD)", title = "GDP in US Since 1960") +
  theme_bw()

global_economy |>
  filter(Country == "United States") |>
  autoplot(GDP / Population) +
  labs(x = "Year", y = "GDP per Capita ($USD)", title = "GDP per Capita in US Since 1960") +
  theme_bw()

The first plot above shows the US ADP since 1960, whereas the second plot shows the US GDP per Capita since 1960, where the GDP was divided by the US population. Based on the two plots above, the general trend does not change. That is, they both have a positive trend, with a slight dip around 2008, followed by the continued positive trend.

aus_livestock |>
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria") |>
  autoplot(Count / 1000) +
  labs(x = "Month", y = "Number of Animals ('000s)",
       title = "Number of Bulls, Bullocks, and Steers Slaugtered in Victoria since 1976") +
  theme_bw()

The above plot shows a general decreasing trend in total number of bulls, bullocks, and steers slaugtered in Victoria since 1976. The total count was divided by 1000, simply for readability on the plot. This only affects the scale of the plot, not the general trends or seasonality.

vic_elec |>
  autoplot(Demand) +
  labs(y = "Demand (MWh)",
       title = "Half-Hourly Electricity Demand in Victoria since 2012") +
  theme_bw()

elec_lambda = vic_elec |>
  features(Demand, features = guerrero) |>
  pull(lambda_guerrero)

vic_elec |>
  autoplot(box_cox(Demand, elec_lambda)) +
  labs(y = "",title = latex2exp::TeX(paste0(
         "Transformed Half-Hourly Electricity Demand in Victoria with $\\lambda$ = ",
         round(elec_lambda,2)))) +
  theme_bw()

The first plot above shows the half-hourly electricity demand for Victoria since 2012. The second plot above shows the same data, but transformed using a Box-Cox transformation and a \(\lambda\) value of 0.1. Comparing the 2 plots, the higher peaks in the first plot are reduced to be more in line with the other peaks. Otherwise, the rest of the transformed plot is similar to the original plot.

aus_production |>
  autoplot(Gas) +
  labs(y = "Production (petajoules)",
       title = "Quarterly Gas Production in Australia since 1956") +
  theme_bw()

gas_lambda = aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)

aus_production |>
  autoplot(box_cox(Gas, gas_lambda)) +
  labs(y = "", title = latex2exp::TeX(paste0(
         "Transformed Quarterly Gas Production in Australia since 1956 using $\\lambda$ = ",
         round(gas_lambda,2)))) +
  theme_bw()

The first plot above shows the quarterly gas production in Australia since 1956. The second plot above shows the same data, but transformed using a Box-Cox transformation and a \(\lambda\) value of 0.11. Comparing the 2 plots, the higher peaks and lower troughs in the first plot are brought more in line with one another. For example, in the original plot, the data before 1970 has very small peaks, and the data after 1990 has larger peaks. In the transformed plot, the peaks throughout all of the data are about the same.

3.3 Why is a Box-Cox transformation unhelpful for the canadian_gas data?

canadian_gas |>
  autoplot() +
  labs(y = "Production (billion cubic metres)",
       title = "Monthly Gas Production in Canada sinc 1960") +
  theme_bw()

## Plot variable not specified, automatically selected `.vars = Volume`

can_gas_lambda = canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)

canadian_gas |>
  autoplot(box_cox(Volume, can_gas_lambda)) +
  labs(y = "", title = latex2exp::TeX(paste0(
         "Transformed Monthly Gas Production in Canada since 1960 using $\\lambda$ = ",
         round(can_gas_lambda,2)))) +
  theme_bw()

Above are 2 plots of the monthly Canadian gas production. The first shows the data as is, whereas the second plot shows the data that’s been transformed using a Box-Cox transformation. Based on the two plots above, the Box-Cox transformation did not modify seasonality enough to bring the peaks in line with one another. As a result, a Box-Cox transformation is not useful for the canadian_gas data.

3.4 What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(9304)
aus_retail_sample <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

series_id = aus_retail_sample$`Series ID`[1]

aus_retail_sample |>
  autoplot(Turnover) +
  labs(y = "Turnover ($Million AUD)", title = latex2exp::TeX(paste0(
         "Monthly Retail Turnover for Series ID ", series_id))) +
  theme_bw()

retail_lambda = aus_retail_sample |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

aus_retail_sample |>
  autoplot(box_cox(Turnover, retail_lambda)) +
  labs(y = "", title = latex2exp::TeX(paste0(
         "Transformed Monthly Retail Turnover for Series ID ", series_id,
         " using $\\lambda$ = ", round(retail_lambda,2)))) +
  theme_bw()

Based on the analysis above using the guerrero feature, a \(\lambda\) value of -0.03. The above 2 plots show the differences between the non-transformed and transformed data. The second plot with the transformed data shows that the variation in seasonality has been minimized over time.

3.5 For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

tobacco_lambda = aus_production |>
  features(Tobacco, features = guerrero) |>
  pull(lambda_guerrero)

tobacco_lambda

## [1] 0.9264636

For the Tobacco series from aus_production, using the guerrero method, the lambda value for the Box-Cox transformation is about 0.93.

economy_lambda = ansett |>
  filter(Airports == "MEL-SYD", Class == "Economy") |>
  features(Passengers, features = guerrero) |>
  pull(lambda_guerrero)

economy_lambda

## [1] 1.999927

For the Economy series from ansett, filtered for flights between Melbourne and Sydney, using the guerrero method, the lambda value for the Box-Cox transformation is about 2.

pedestrian_lambda = pedestrian |>
  filter(Sensor == "Southern Cross Station") |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)

pedestrian_lambda

## [1] -0.2501616

For the Pedestrian series from pedestrian, filtered by Southern Cross Station, using the guerrero method, the lambda value for the Box-Cox transformation is about -0.25.

3.7 Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

gas |>
  autoplot() +
  labs(x = "Quarter", y = "Production (petajoules",
       title = "Quarterly Gas Production since Q3 2005") +
  theme_bw()

## Plot variable not specified, automatically selected `.vars = Gas`

Based on the above plot, there is seasonality throughout the year, with a peak in Q3, and a trough in Q1. There is also a positive trend over the past 5 years.

Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components() |>
  autoplot() +
  labs(title = "Classical Decomposition of Gas Production") +
  theme_bw()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

The second plot above shows the calculated trend-cycle, whereas the third plot above shows the seasonal component, which includes the seasonal indices.

Do the results support the graphical interpretation from part a?

Yes, the seasonal component in the plot above shows a low in Q1, followed by a peak in Q3, which is consistent over time.

Compute and plot the seasonally adjusted data.

gas |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components() |>
  as_tsibble() |>
  autoplot(Gas, color = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(x = "Quarter", y = "Production (petajoules)",
       title = "Seasonally Adjusted Gas Production") +
  theme_bw()

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas_outlier = gas |>
  mutate(Outlier = case_when(
    Quarter == yearquarter("2008 Q2") ~ Gas + 300,
    .default = Gas)
    )

gas_outlier |>
  autoplot(Outlier) +
  labs(x = "Quarter", y = "Production (petajoules",
       title = "Quarterly Gas Production since Q3 2005 with Outlier in 2008 Q2") +
  theme_bw()

gas_outlier |>
  model(classical_decomposition(Outlier, type = "multiplicative")) |>
  components() |>
  as_tsibble() |>
  autoplot(Outlier, color = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(x = "Quarter", y = "Production (petajoules)",
       title = "Seasonally Adjusted Gas Production with Outlier in 2008 Q2") +
  theme_bw()

When adding an outlier into the dataset, not only is there a large spike in the seasonally adjusted plot, but the remainder of the seasonally adjusted plot is not as smooth as the original data. Additionally, there is not a clear positive trend like there was in the original plot.

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas_outlier = gas |>
  mutate(Outlier = case_when(
    Quarter == yearquarter("2010 Q1") ~ Gas + 300,
    .default = Gas)
    )

gas_outlier |>
  model(classical_decomposition(Outlier, type = "multiplicative")) |>
  components() |>
  as_tsibble() |>
  autoplot(Outlier, color = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(x = "Quarter", y = "Production (petajoules)",
       title = "Seasonally Adjusted Gas Production with Outlier in 2010 Q1") +
  theme_bw()

Looking at the above plot with the outlier towards the end, it shows that there is a difference in the seasonally adjusted plot. That is, there is still a large peak near the outlier, but the remainder of the plot is smoother than the plot with the outlier in the middle. Additionally, the plot above with the outlier at the end shows more of a positive trend than the plot with the outlier in the middle.

3.8 Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

aus_retail_sample |>
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components() |>
  autoplot() +
  labs(title = latex2exp::TeX(paste0(
         "X11 Decomposed Retail Data for Series ID ", series_id))) +
  theme_bw()

Looking at the X-11 decomposition above, it appears there are some outliers, based on the irregular portion of the plot. That is, there are high peaks some time in 1994, 2000-2001, and then a low peak in 2002-2003. These were not evident in the original plots generated.

3.9 Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

Looking at the trend portion of the STL decomposition above, there is a clear positive trend. Based on the scales of both the original plot and the trend plot, this increases from about 6500 to about 9000 from 1978 to 1995. The season_year portion of the STL decomposition above shows there is annual seasonality within the data, with the lowest civilian labor numbers being in January and July, and the highest being in March and December. Based on the scale of the seasonality plot, this accounts for +/- 100 civilians from the general trend. The remainder portion of the STL decomposition shows that the remainder of the variation throughout the data accounted for roughly +/- 100 civilians after accounting for both the trend and seasonality. The exception to this is in approximately 1991-1992, where the remainder accounts for a drop in around 400 civilians.

Is the recession of 1991/1992 visible in the estimated components?

Yes, the 1991/1992 recession is visible in the remainder portion of the STC decomposition, where there was a large drop in civilian labor numbers of close to 400 civilians. In all other years, the most the civilian labor numbers dropped–after accounting for both the trend and seasonality–was by about 125 civilians.

Data 624 Homework 2

Michael Drake

2026-02-011