3.1 Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

gdp_pc <- global_economy %>%
  mutate(GDP_per_capita = GDP / Population)

gdp_pc

gdp_pc %>%
  ggplot(aes(x = Year, y = GDP_per_capita, group = Country, colour = Country)) +
  geom_line(show.legend = FALSE) +
  labs(title = "GDP per capita for each country over time",
       y = "GDP per capita (USD)",
       x = "Year")

## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

gdp_pc %>%
  filter(Year == max(Year)) %>%
  arrange(desc(GDP_per_capita)) %>%
  slice(1)

Luxemborg has the highest GDP per capita.
They grew significantly between 1970 and 1980, grabbing the top spot around the 1980s and maintaining it throughout the decades.
Their GDP per capita peaked around 2015.

3.2 For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy.

us_gdp <- global_economy |>
  filter(Country == "United States") |>
  mutate(GDP_per_capita = GDP / Population)

# Original GDP
us_gdp |>
  autoplot(GDP) +
  labs(title = "United States GDP (Original)", y = "US$ (Billions)")

# GDP per capita (population adjustment)
us_gdp |>
  autoplot(GDP_per_capita) +
  labs(title = "United States GDP per Capita", y = "US$ per person")

# GDP inflation adjustment
us_gdp <- global_economy |>
  filter(Country == "United States") |>
  mutate(GDP_real = GDP / (CPI / 100))  # CPI base = 100

autoplot(us_gdp, GDP_real) +
  labs(title = "US Real GDP (Inflation adjusted)", y = "US$ (Billions, constant prices)")

us_gdp <- us_gdp |>
  mutate(GDP_per_capita_real = GDP_real / Population)

autoplot(us_gdp, GDP_per_capita_real) +
  labs(title = "US Real GDP per Capita", y = "US$ (constant prices)")

Original GDP: Shows strong, clear growth with a slight dip around 2008.
GDP per capita: Adjusting for population gives a clearer picture of living standards. The series still grows strongly, but the growth is a bit smoother and not as steep as total GDP.
Real GDP: Adjusting for inflation flattens the curve and shows the real GDP over time - this grows steadily but at a slower rate than the Original GDP plot.
Real GDP per capita: Adjusting for both inflation and population shows steady growth, but with clear dips during times of financial crisis (1970s, early 1990s, 2008).

Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

bulls <- aus_livestock |>
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria")

# Original
bulls |>
  autoplot(Count) +
  labs(title = "Victorian Slaughter of Bulls, Bullocks & Steers (Original)", y = "Count")

# Log transform
bulls |>
  autoplot(log(Count)) +
  labs(title = "Victorian Slaughter of Bulls, Bullocks & Steers (Log scale)", y = "log(Count)")

The original data shows large fluctuations in the early years and smaller ones later.
After applying the log transformation, the magnitude of the fluctuations becomes stabilized and appears more consistent across the entire period.
The seasonal pattern and overall decline is still visible, but the variance is less extreme.

Victorian Electricity Demand from vic_elec.

# Original electricity demand
vic_elec |>
  autoplot(Demand) +
  labs(title = "Victorian Electricity Demand (Original)", y = "MW")

# Calendar adjustment: average demand by day of week
vic_elec |>
  index_by(Day = date(Time)) |>
  summarise(Daily_Demand = sum(Demand)) |>
  mutate(Weekday = wday(Day, label = TRUE)) |>
  ggplot(aes(x = Day, y = Daily_Demand, colour = Weekday)) +
  geom_line() +
  labs(title = "Victorian Electricity Demand (Calendar Adjustment)", 
       y = "MW", x = "Day")

The raw data is very noisy, which makes it hard to see meaningful patterns - only highlighting broad peaks and dips
After applying a calendar adjustment, the underlying weekly pattern is clear.
Demand is consistently higher on weekdays (especially Tuesday–Thursday) and lower on weekends, with Sundays showing the lowest demand.
This adjustment highlights how electricity usage is strongly tied to the calendar cycle.

Gas production from aus_production.

gas <- aus_production |>
  select(Gas)

# Original
gas |>
  autoplot(Gas) +
  labs(title = "Australian Gas Production (Original)", y = "Petajoules")

# Log transform
gas |>
  autoplot(log(Gas)) +
  labs(title = "Australian Gas Production (Log scale)", y = "log(Petajoules)")

The original plot shows a strong upward trend with seasonal fluctuations that increase in size over time.
After applying the log transformation, the seasonal fluctuations appear more stable across time.
The seasonal pattern and overall decline is still visible, but the variance is less extreme.

3.3 Why is a Box-Cox transformation unhelpful for the canadian_gas data?

library(fpp3)
canadian_gas

canadian_gas %>%
  autoplot(Volume) +
  labs(title = "Canadian Gas Production", y = "Monthly Volume")

canadian_gas %>%
  features(Volume, features = guerrero)

lambda <- canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)

canadian_gas |>
  autoplot(box_cox(Volume, lambda)) +
  labs(title = paste("Canadian Gas Production (Box-Cox, lambda =", round(lambda,3), ")"))

A Box–Cox transformation is unhelpful for the Canadian gas production series because the variance is already stable with seasonal fluctuations remaining about the same size across time. lambda=0.58 suggests a mild power transform, but after I applied it to the data it produced an almost identical plot - showing that there is no additional benefits of using box-cox here.

3.4 What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(2705)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

# Estimate the Box–Cox lambda using Guerrero
lam <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

lam

## [1] 0.4807535

Since lambda is between 0 and 1, a mild Box–Cox power transform with lambda = 0.4807535 will stabilize the variance - if even only slightly. Since it’s not exactly 0, log transform may not be the best.

myseries_boxcox <- myseries |>
  mutate(Turnover_boxcox = box_cox(Turnover, lambda = 0.4807535))

myseries_boxcox |>
  pivot_longer(cols = c(Turnover, Turnover_boxcox), 
               names_to = "Series", values_to = "Value") |>
  ggplot(aes(x = Month, y = Value)) +
  geom_line() +
  facet_wrap(~ Series, scales = "free_y", ncol = 1) +
  labs(title = "Retail Turnover: Original vs Box-Cox Transformed (lambda = 0.481)",
       y = "Turnover", x = "Month")

With this transformation, the overall trend remains - but it is a bit smoother (less changing variance).

3.5 For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

Tobacco from aus_production

# Tobacco production
tobacco <- aus_production %>% select(Tobacco)

# lambda
tobacco %>%
  features(Tobacco, features = guerrero)

# transformation
tobacco_boxcox <- tobacco |>
  mutate(Tobacco_boxcox = box_cox(Tobacco, lambda = 0.926))

# Original vs Box-Cox
tobacco_boxcox |>
  pivot_longer(cols = c(Tobacco, Tobacco_boxcox),
               names_to = "Series", values_to = "Value") |>
  ggplot(aes(x = Quarter, y = Value)) +
  geom_line() +
  facet_wrap(~ Series, scales = "free_y", ncol = 1) +
  labs(title = "Tobacco Production: Original vs Box-Cox Transformed (lambda = 0.926)",
       y = "Production")

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

For Tobacco from aus_production, we got lambda=0.93, which is very close to 1. This shows that the variance is already stable. The Box–Cox transformation only changed the scale slightly, so the original plot was already appropriate without major adjustment.

Economy class passengers between Melbourne and Sydney from ansett

ansett %>%
  filter(Airports == "MEL-SYD", Class == "Economy") %>%
  features(Passengers, features = guerrero)

# transformation
ansett_boxcox <- ansett |>
  mutate(Passengers_boxcox = box_cox(Passengers, lambda = 2))

# Original vs Box-Cox
ansett_boxcox |>
  pivot_longer(cols = c(Passengers, Passengers_boxcox),
               names_to = "Series", values_to = "Value") |>
  ggplot(aes(x = Week, y = Value)) +
  geom_line() +
  facet_wrap(~ Series, scales = "free_y", ncol = 1) +
  labs(title = "Economy Class MEL-SYD: Original vs Box-Cox Transformed (lambda = 2)",
       y = "Passengers")

For the Melbourne–Sydney economy passengers, we have lambda = 2. This suggests that transforming this data will only exaggerate the differences (by squaring the data) rather than stabilizing them. We should just use the original plot without transformation in this case.

Pedestrian counts at Southern Cross Station from pedestrian.

pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
  features(Count, features = guerrero)

# lambda = -0.25 -> lambda = 0
ped_sc_boxcox <- pedestrian |>
  mutate(Count_log = log(Count))

# Original vs Box-Cox
ped_sc_boxcox |>
  pivot_longer(cols = c(Count, Count_log),
               names_to = "Series", values_to = "Value") |>
  ggplot(aes(x = Date_Time, y = Value)) +
  geom_line() +
  facet_wrap(~ Series, scales = "free_y", ncol = 1) +
  labs(title = "Southern Cross Pedestrians: Original vs Box-Cox Transformed (lambda = 0)",
       y = "Pedestrian Counts")

For the Southern Cross pedestrians, lambda = -0.25. There is unstable variance with extreme peaks, and a strong transformation is needed to stabilize it. Since lambda is close to 0, a power transformation less than 0 helps stabilize variance - so a log-like transformation is appropriate here.

3.7 Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)
gas

3.7a Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

gas |> autoplot(Gas) +
  labs(title = "Australian Gas Production (last 5 years)",
       y = "Petajoules")

In this plot, I can see very clear seasonality (quarterly cycles) and a very slight upward trend.

3.7b Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas_dc <- gas |> 
  model(classical_decomposition(Gas, type = "multiplicative")) |> 
  components()

gas_dc

autoplot(gas_dc)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Trend-cycle (trend): slowly increasing, from ~200 in 2006 Q1 to ~226 by 2009 Q4.
Seasonal indices (seasonal): These are repeated every year.
- Q1 and Q4 consistently have lower values (~0.88–0.93).
- Q2 and Q3 consistently have higher values (~1.07–1.13).

3.7c Do the results support the graphical interpretation from part a?

In 3.7a, I mentioned clear seasonality (quarterly cycles) and a very slight upward trend. So yes, the trend component from part b’s decomposition confirms this slight gradual increase. The seasonal component also confirms the clear quarterly seasonality, with higher values around Q2–Q3 and lower values in Q1–Q4, matching the peaks and dips in part a’s plot.

Since the remainder is small, we can conclude that most of the variation was cause by trend and seasonality.

3.7d Compute and plot the seasonally adjusted data.

# Multiplicative classical decomposition
gas_dc <- gas |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components()

# original vs seasonally adjusted
gas_dc |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Gas), colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "blue") +
  labs(title = "Australian Gas Production (Seasonally Adjusted)",
       y = "Petajoules", x = "Quarter")

The grey line shows the original series with strong seasonal ups and downs each year.
The blue line is the seasonally adjusted series — much smoother because the regular seasonal fluctuations are removed.
The upward trend and steep drop towards the end is clearer to see once the seasonal variation is removed.

3.7e Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas_outlier <- gas
gas_outlier$Gas[10] <- gas_outlier$Gas[10] + 300  # adding 300 to one observation

gas_dc_outlier <- gas_outlier |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components()

gas_dc_outlier |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Gas), colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "red") +
  labs(title = "Australian Gas Production (With Outlier, Seasonally Adjusted)",
       y = "Petajoules", x = "Quarter")

The grey line (original) has a huge spike right before 2008 Q1 because of the fake outlier.
The red line (seasonally adjusted) also spikes at the same point.
- Since the spike is not seasonal, it is considered part of the trend and remainder components, distorting the entire seasonally adjusted series and making it seem as though there was a sharp spike and drop that is entirely false.
- Outliers can cause the data to be very misleading if it remains uncorrected.

3.7f Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas_mid <- gas
gas_end <- gas

# middle outlier
gas_mid$Gas[10] <- gas_mid$Gas[10] + 300

# end outlier
gas_end$Gas[nrow(gas_end)] <- gas_end$Gas[nrow(gas_end)] + 300

gas_dc_mid <- gas_mid |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components()

gas_dc_end <- gas_end |>
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components()

# outlier in the middle
p_mid <- gas_dc_mid |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Gas), colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "red") +
  labs(title = "Australian Gas Production (Outlier in Middle, Seasonally Adjusted)",
       y = "Petajoules", x = "Quarter")

# outlier at the end
p_end <- gas_dc_end |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Gas), colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "blue") +
  labs(title = "Australian Gas Production (Outlier at End, Seasonally Adjusted)",
       y = "Petajoules", x = "Quarter")

p_mid

p_end

Yes, there is a difference if the outlier is near the end rather than in the middle of the time-series.
- When there is an outlier in the middle (red line), it distorts both the seasonality and trend around that point but can eventually be smoothed out over time by the surrounding data.
- When there is an outlier at the end of the series, it only affects the last few points but the earlier seasonality and trend remain basically unaffected.
  – However, because this outlier is at the end, there is no following data to assist in smoothing out/ balancing out this data and can much more misleading especially for those who are interested in the latest point in a series.

3.8 Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

set.seed(2705) 
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

myseries %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components() %>%
  autoplot()

There was a steady upward trend in retail turnover over time showing significant growth around the early 2000s.
With regards to the seasonality, the repeating monthly patterns are very consistent across the entire period but the magnitude/impact does seem to lessen overtime.
The irregular plot does highlight some very sharp spikes (especially around the mid to late 1980s), which may be due to outliers or unusual events that occurred.
- These spikes stood out to me in this decomposition - because they are not very visible in any of the other graphs.

3.9 Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

Is the recession of 1991/1992 visible in the estimated components?

There is a steady upward trend in the labor force from the late 1970s to the mid-1990s. If examined closely we can also see a very slight dip around 1991/1992.
The seasonal component shows recurring annual patterns: higher values in March and December, and lower values around January and August The scale (±100 people) is small when comparing it to the overall level of 7,000–9,000, so although seasonality is present it is quite modest.
In the remainder we can see some irregular fluctuations, including a large dip around 1991/1992 (likely connected to that recession). This shows that the recession did in fact have a clear effect beyond the regular seasonal and trend patterns.
Overall, the decomposition confirms that both trend and remainder show the impact of the 1991/1992 recession with a slight dip in the trend and sharp dips in the remainder around those years.

Homework2

Aaliyah John-Harry

2025-09-09