Homework02Hyndman

Introduction

The following RMD contains answers to exercises in chapter 3 of the Forecasting: Principles and Practice textbook for CUNY SPS DATA 624 Spring 2025 https://otexts.com/fpp3/decomposition-exercises.html. This chapter focuses on Time Series Decomposition, its applications, uses, and methods. The 3.7 Exercises answered here include 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9.

Load Libraries

library(fpp3)
library(tidyverse)
library(seasonal)

Question 1

Prompt: Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

# Calculate GDP per capita and remove missing values
global_gdp <- global_economy |> 
  select(Country, Year, GDP, Population) |> 
  mutate(GDP_per_capita = GDP / Population) |> 
  drop_na(GDP_per_capita)

# Plot GDP per capita over time for each country
ggplot(global_gdp, aes(x = Year, y = GDP_per_capita, color = Country)) +
  geom_line() +
  labs(title = "GDP per Capita Over Time",
       x = "Year",
       y = "GDP per Capita") + 
  scale_y_continuous(labels = scales::label_dollar()) +
  theme_minimal() +
  theme(legend.position = "none")

# Find the country with the highest GDP per capita each year
highest_gdp_per_capita <- global_gdp |> 
  index_by(Year) |> 
  slice_max(order_by = GDP_per_capita, n = 1) |> 
  ungroup()

# Display the top GDP per capita countries over time
print(highest_gdp_per_capita)

## # A tsibble: 58 x 5 [1Y]
## # Key:       Country [263]
##    Country        Year     GDP Population GDP_per_capita
##    <fct>         <dbl>   <dbl>      <dbl>          <dbl>
##  1 United States  1960 5.43e11  180671000          3007.
##  2 United States  1961 5.63e11  183691000          3067.
##  3 United States  1962 6.05e11  186538000          3244.
##  4 United States  1963 6.39e11  189242000          3375.
##  5 United States  1964 6.86e11  191889000          3574.
##  6 Kuwait         1965 2.10e 9     473554          4429.
##  7 Kuwait         1966 2.39e 9     524856          4556.
##  8 United States  1967 8.62e11  198712000          4336.
##  9 United States  1968 9.42e11  200706000          4696.
## 10 United States  1969 1.02e12  202677000          5032.
## # ℹ 48 more rows

# Visualize the top GDP per capita countries over time
ggplot(highest_gdp_per_capita, aes(x = Year, y = GDP_per_capita, fill = Country)) +
  geom_col() +  
  labs(title = "Highest GDP per Capita by Year",
       x = "Year",
       y = "GDP per Capita",
       fill = "Country") +
  scale_y_continuous(labels = scales::label_dollar()) +
  scale_fill_brewer(palette = "Set2") + 
  theme_minimal()

The country that has the highest GDP per capita is most recently Luxemborg. Across time in this data, the highest GDP per capita has been the United States, Kuwait, Monaco, the UAE, Liechtenstein, and Luxemborg. Monaco held the title most often, with a total of 43 years. The highest GDP per capita seems to indicate global cycles with dips around 1995 and 2000. Other than these dips, the data consists of an upward trend in global maximum GDP per capita.

It is worth noting that there are some rows in the data that contain non-country values such as “Middle income.” They were not discluded from the data because they did not affect the final Highest GDP per Capita by Year graph and calculation.

Question 2

Prompt: For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy
Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock
Victorian Electricity Demand from vic_elec
Gas production from aus_production

United States GDP from global_economy

# Load United States GDP
us_gdp <- global_economy |> 
  filter(Country == "United States")|> 
  select(Country, Year, GDP)

# Plot Year and GDP as a line graph
ggplot(us_gdp, aes(x = Year, y = GDP)) +
  geom_line() +
  scale_y_continuous("Total GDP",
                     breaks = scales::breaks_extended(8),
                     labels = scales::label_currency()
  ) +
  theme_minimal()

Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock

# Load Slaughter of Victorian “Bulls, bullocks and steers”
slaughter_bbs <- aus_livestock |> 
  filter(Animal == "Bulls, bullocks and steers")|> 
  select(Month, Animal, Count)

# Plot Month and Count as a line graph
ggplot(slaughter_bbs, aes(x = Month, y = Count)) +
  geom_line() +
  scale_y_continuous("Total Slaughtered Bulls, Bullocks and Steers", labels = scales::label_comma()
  ) +
  theme_minimal()

# Plot Month and Count with geom smooth to visualize changes over time
ggplot(slaughter_bbs, aes(x = Month, y = Count)) +
  geom_smooth() +
  scale_y_continuous("Total Slaughtered Bulls, Bullocks and Steers", labels = scales::label_comma()
  ) +
  theme_minimal()

This data is quite concentrated as it spans several decades and includes monthly-level cuts. Using geom_smooth(), it is possible to see the trends that the data experiences at an aggregated level.

Victorian Electricity Demand from vic_elec

# Load Victorian Electricity Demand
vic_elec_demand <- vic_elec |> 
  select(Time, Demand)

# Plot Month and Count as a line graph
ggplot(vic_elec_demand, aes(x = Time, y = Demand)) +
  geom_line() +
  scale_y_continuous("Victorian Electricity Demand (MWh)", labels = scales::label_comma()
  ) +
  theme_minimal()

# Plot Month and Count as a line graph
ggplot(vic_elec_demand, aes(x = Time, y = Demand)) +
  geom_smooth() +
  scale_y_continuous("Victorian Electricity Demand (MWh)", labels = scales::label_comma()
  ) +
  theme_minimal()

Once again, this data is too concentrated to meaningfully parse trends. It uses a scale of 30 minutes and spans several years. Using geom_smooth(), it is possible to see the trends that the data experiences at an aggregated level.

Gas Production from aus_production

# Compute Box-Cox lambda using the Guerrero method
lambda <- aus_production |> 
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)

# Load Gas Production, extract quarter for graphing, and apply Box-Cox transformation
gas_prod <- aus_production |> 
  select(Quarter, Gas) |> 
  mutate(
    Quarter_Num = as.integer(str_sub(Quarter, 7, 7)), 
    Transformed_Gas = box_cox(Gas, lambda)  # Use stored lambda
  )

# Plot Gas Production as a transformed line graph by Quarter and Gas
ggplot(gas_prod, aes(x = Quarter, y = Transformed_Gas)) +
  geom_line() + 
  geom_point(aes(color = factor(Quarter_Num))) +  # No need to specify y again
  scale_color_brewer(palette = "Set2") +
  labs(title = paste0("Gas Production & Box-Cox Transformed Gas (lambda = ", round(lambda, 2), ")"),
       x = "Year and Quarter",
       y = "Transformed Gas Production",
       color = "Quarter") +
  theme_minimal()

This data includes quarterly observations. To see that seasonality, I added colored points to the line graph based on the quarter. Now it is easy to see that the increased variance within each year favors higher values for Q3 and Q2 and lower values for Q1 and Q4. As far as transformation goes, the box_cox transformation is appropriate for this data. This transformation was chosen as it “makes the size of the seasonal variation about the same across the whole series, as that makes the forecasting model simpler” (Hyndman).

Question 3

Prompt: Why is a Box-Cox transformation unhelpful for the canadian_gas data?

# Plot Month and Volume as a line graph to visualize data
ggplot(canadian_gas, aes(x = Month, y = Volume)) +
  geom_smooth() +
  ylab("Gas Volume (billions of cubic metres)") +
  xlab("Year and Month") + 
  theme_minimal()

The data above is quite smooth. This means the Box-Cox transformation would not regulate any variance, as the variance is already largely steady.

Question 4

Prompt: What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(64)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

head(myseries)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State    Industry                         `Series ID`    Month Turnover
##   <chr>    <chr>                            <chr>          <mth>    <dbl>
## 1 Victoria Other specialised food retailing A3349799R   1982 Apr     34.9
## 2 Victoria Other specialised food retailing A3349799R   1982 May     34.6
## 3 Victoria Other specialised food retailing A3349799R   1982 Jun     34.6
## 4 Victoria Other specialised food retailing A3349799R   1982 Jul     35.2
## 5 Victoria Other specialised food retailing A3349799R   1982 Aug     33.8
## 6 Victoria Other specialised food retailing A3349799R   1982 Sep     35.4

# Compute Guerrero lambda for the selected series (Turnover)
myseries_lambda <- myseries |> 
  features(Turnover, features = guerrero) |> 
  pull(lambda_guerrero)

# Apply the Box-Cox transformation using the computed lambda
myseries <- myseries |> 
  mutate(Turnover_Transformed = box_cox(Turnover, myseries_lambda))

# Plot the transformed turnover over time
ggplot(myseries, aes(x = Month, y = Turnover_Transformed)) + 
  geom_line() +
  labs(title = paste0("Box-Cox Transformed Retail Turnover (lambda = ", round(myseries_lambda, 2), ")"),
       x = "Year and Month",
       y = "Transformed Turnover") +
  theme_minimal()

Based on the guerrera calculation, -0.35 would be the ideal lambda to stabilize the variance of the retail data. Looking at the graph, I agree that this lambda does cause a relatively standard variance across time.

Question 5

Prompt: For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.

Tobacco from aus_production
Economy class passengers between Melbourne and Sydney from ansett
Pedestrian counts at Southern Cross Station from pedestrian

Each of these transformations will follow the same logic flow:

Guerrero lambda must be found
Each series will be stabilized using that lambda value
Then the series will be plotted using ggplot

Tobacco from aus_production

# Compute Guerrero lambda for Tobacco series
lambda_tobacco <- aus_production |>
  select(Quarter, Tobacco) |>
  features(Tobacco, features = guerrero) |>
  pull(lambda_guerrero)

# Apply Box-Cox transformation to Tobacco column
tobacco_transformed <- aus_production |> 
  mutate(Tobacco_Transformed = box_cox(Tobacco, lambda_tobacco))

# Plot the transformed series
ggplot(tobacco_transformed, aes(x = Quarter, y = Tobacco_Transformed)) + 
  geom_line() +
  labs(title = paste("Box-Cox Transformed Tobacco Series (lambda =", round(lambda_tobacco, 2), ")"),
       x = "Quarter", y = "Transformed Tobacco") +
  theme_minimal()

Economy class passengers between Melbourne and Sydney from ansett

# Filter data for Economy class passengers between Melbourne and Sydney
econ_mel_syd <- ansett |>
  filter(Airports %in% c("MEL-SYD", "SYD-MEL")) |>
  filter(Class == "Economy")

# Compute Guerrero lambda for Economy Medbourne Sydney data
lambda_econ_ms <- econ_mel_syd |>
  select(Week, Passengers) |>
  features(Passengers, features = guerrero) |>
  pull(lambda_guerrero)

# Apply Box-Cox transformation to Passengers column
econ_ms_transformed <- econ_mel_syd |> 
  mutate(Econ_MS_Transformed = box_cox(Passengers, lambda_econ_ms))

# Plot the transformed series
ggplot(econ_ms_transformed, aes(x = Week, y = Econ_MS_Transformed)) + 
  geom_line() +
  labs(title = paste("Box-Cox Transformed Passenger Series (lambda =", round(lambda_econ_ms, 2), ")"),
       x = "Week", y = "Transformed Passengers") +
  theme_minimal()

Pedestrian counts at Southern Cross Station from pedestrian

pedestrian

## # A tsibble: 66,037 x 5 [1h] <Australia/Melbourne>
## # Key:       Sensor [4]
##    Sensor         Date_Time           Date        Time Count
##    <chr>          <dttm>              <date>     <int> <int>
##  1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630
##  2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826
##  3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567
##  4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264
##  5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139
##  6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77
##  7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44
##  8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56
##  9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113
## 10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166
## # ℹ 66,027 more rows

# Filter data for pedestrian counts at Southern Cross Station
scs_peds <- pedestrian |>
  filter(Sensor == "Southern Cross Station") |>
  select(Date, Count)

# Compute Guerrero lambda for pedestrian counts at Southern Cross Station data
lambda_scs_peds <- scs_peds |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)

# Apply Box-Cox transformation to Passengers column
scs_peds_transformed <- scs_peds |> 
  mutate(SCS_Peds_Transformed = box_cox(Count, lambda_scs_peds))

# Plot the transformed series
ggplot(scs_peds_transformed, aes(x = Date, y = SCS_Peds_Transformed)) + 
  geom_line() +
  labs(title = paste("Box-Cox Transformed Pedestrian Count Series (lambda =", round(lambda_scs_peds, 2), ")"),
       x = "Date", y = "Transformed Passengers") +
  theme_minimal()

Wow, what a graph. In this case, the data needs more work to become insightful. To fix the readability of this graph, I would take a smoothing algorithm like geom_smooth(). This would allow for trends to be read. Also, the seasonality of the data can give insights, so gg_season() would come in handy here.

Question 7

Prompt: Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)
head(gas)

## # A tsibble: 6 x 2 [1Q]
##     Gas Quarter
##   <dbl>   <qtr>
## 1   221 2005 Q3
## 2   180 2005 Q4
## 3   171 2006 Q1
## 4   224 2006 Q2
## 5   233 2006 Q3
## 6   192 2006 Q4

a. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

b. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

c. Do the results support the graphical interpretation from part a?

d. Compute and plot the seasonally adjusted data.

e. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

a. Time Series Plot and Seasonal Fluctuations/ Trend-Cycle

# Extract quarter number for graphing
gas <- gas |> mutate(Quarter_Num = as.integer(str_sub(Quarter, 7, 7)))

# Plot with colored points by quarter
ggplot(gas, aes(x = Quarter, y = Gas)) +
  geom_line() +
  geom_point(aes(color = factor(Quarter_Num)), size = 3) + 
  scale_color_brewer(palette = "Set2") +  
  labs(title = "Gas Production Over Last 5 Years",
       x = "Year and Quarter",
       y = "Gas Production",
       color = "Quarter") +
  theme_minimal()

The plot fluctuates by quarter, where Q1 < Q2 < Q3 > Q4. It seems April to September are the months containing the highest gas production based on their quarter averages. The overall trend seems to be increasing over time, with many quarters greater than the previous observation of that quarter.

b. Classical Decomposition

# Perform classical decomposition with multiplicative type
gas_decomposed <- gas |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components()

# Plot decomposition results with autoplot
autoplot(gas_decomposed) +
  labs(title = "Classical Decomposition of Gas Production (Multiplicative)")

Using this time series decomposition, it is easy to see the trend is increasing over time, the seasonality is consistent over quarters and completes a loop after 4 quarters (yearly). The remainder graph shows the variation in the remainder is much lower than the other components.

c. Results of b compared to a

My analysis in question a still stands - gentle upward trend with mostly consistent quarterly data as seasons. After looking at the decomposition graphs in b, it is easier to see the actual significance of the upward trend, the seasonality by quarter, and how much variability affects these.

d. Plotting Seasonally Adjusted Data

# Calculate multiplicative seasonal adjustment
gas_decomposed <- gas_decomposed |>
  mutate(Seasonally_Adjusted = Gas / seasonal)

# Plot this seasonally adjusted data
autoplot(gas_decomposed, Seasonally_Adjusted) +
  labs(title = "Seasonally Adjusted Gas Production",
       x = "Year and Quarter",
       y = "Seasonally Adjusted Gas Production") +
  theme_minimal()

After seasonally adjusting the data, the upward trend and its randomness become easier to see. There are some peaks and dips in the time series above, but the trend is still overall upward.

e. Outlier and Effect

# Increase the sixth observation by 300
gas_outlier <- gas
gas_outlier$Gas[6] <- gas_outlier$Gas[6] + 300

# Recompute multiplicative decomposition
gas_outlier_decomposed <- gas_outlier |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components() |>
  mutate(Seasonally_Adjusted = Gas / seasonal)

# Plot seasonally adjusted data with outlier
autoplot(gas_outlier_decomposed, Seasonally_Adjusted) +
  labs(title = "Seasonally Adjusted Gas Production (With Outlier)",
       x = "Year and Quarter",
       y = "Seasonally Adjusted Gas Production") +
  theme_minimal()

Oh my, it seems the outlier has really upset the seasonality adjustment’s ability to determine the quarterly consistancies. This result likely comes from the fact that the sixth observation was changed, which corresponds to Q4, one of the lower seasonal quarters. The seasonaly adjusted graph includes the Q3 and Q4 dips that would otherwise be seasonally adjusted for.

f. Outlier Location Effect

# Outlier near the end
gas_outlier_latter <- gas
gas_outlier_latter$Gas[18] <- gas_outlier_latter$Gas[18] + 300 

# Recompute decomposition
gas_outlier_latter_decomposed <- gas_outlier_latter |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components() |>
  mutate(Seasonally_Adjusted = Gas / seasonal)

# Plot seasonally adjusted data with outlier near the end
autoplot(gas_outlier_latter_decomposed, Seasonally_Adjusted) +
  labs(title = "Seasonally Adjusted Gas Production for Latter Outlier",
       x = "Year and Quarter",
       y = "Seasonally Adjusted Gas Production") +
  theme_minimal()

There does not seem to be much of a difference between the near-start outlier and the near-end outlier as far as seasonality adjustment ability is affected. If anything, the latter outlier graph has slightly decreased variability in the seasonal trend.

Question 8

Prompt: Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

# Mirroring question 4's logic, get retail time series data from Exercise 7
set.seed(64)
myseries_x11 <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

# X-11 Decomposition to myseries_x11
x11_dcmp <- myseries_x11 |>
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components()

# Autoplot the X-11 decomposition
autoplot(x11_dcmp) +
  labs(title =
    "Decomposition of Australian retail employment using X-11.")

The data does have an upward trend, but it is easier to see the different bumps in that trend with this X-11 trend graph. Seasonality shows that there is a large increase in a single season with the other seasons’ variabilities not as extreme. For the irregularity graph, the most variability seems to happen around 1994, 2000-2002, and 2009-2015. These jumps likely correlate with the Australian economic cycles at these times.

Question 9

Prompt: Figures 3.19 and 3.20 (see textbook) show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

a. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

a. Figure 3.20 is remarkable helpful when analysing the trends of seasonality across years. Through this breakdown, we can see the seasons that go against the upward trend at the second half of the graph - March, August and November all end below their averages - and the months that have the most variability overall - March, July, and August are the only months having greater than 25 units of variance.

b. Is the recession of 1991/1992 visible in the estimated components?

b. The Figure 3.19 is much more helpful in visualizing the recession of 91/92. The remainder graph here shows the major dip where the recession happened, in the otherwise relatively smooth variability line graph. Also, the value line graph that shows the full data does show a dip in its upward trend during this recession.