Assignment 2: Time Series Decomposition - DATA 624

Author

Amanda Rose Knudsen

Published

February 15, 2025

Assignment 2: Do exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the online Hyndman book. Link to Chapter 3 exercises for reference.

Loading the Data

library(tidyverse)
library(fpp3)
library(seasonal)
library(ggrepel)

3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

First, we will divide the GDP by population to obtain the per-capita GDP.

per_capita_gdp <- global_economy |> 
  mutate(gdp_per_capita = GDP / Population) |> 
  select(Country, Year, gdp_per_capita) |> 
  filter(!is.na(gdp_per_capita))

We plot the below to see the general shape of each country’s GDP per capita over time. After some trial and error, we decide to remove labels from this plot because it is way too much to see all 250+ countries at the same time, when all we’re trying to do is look at the ‘top’.

ggplot(per_capita_gdp, aes(x = Year, y = gdp_per_capita, group = Country)) +
  geom_line(color = "navy", alpha = 0.3) + 
  labs(title = "GDP per Capita for each Country Over Time",
       x = "Year",
       y = "GDP per Capita") +
  theme_minimal() +
  theme(legend.position = "none")

We’ll find out which 5 countries have the overall highest top GDP per capita so we can then filter to ‘highlight’ those and make the time series plot more visually approachable.

# First we pull the top 5 countries with highest max GDP per capita across years
top_5_countries <- per_capita_gdp |> 
  as_tibble() |>
  group_by(Country) |> 
  summarize(max_gdp_per_capita = max(
    gdp_per_capita, na.rm = TRUE), .groups = "drop") |> 
  arrange(desc(max_gdp_per_capita)) |> 
  slice_head(n = 5) |> 
  pull(Country)
top_5_countries

[1] Monaco           Liechtenstein    Luxembourg       Norway          
[5] Macao SAR, China
263 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

# Then we filter data for those top 5 countries so we can just highlight those
top_5_data <- per_capita_gdp |> 
  filter(Country %in% top_5_countries)

# Then we plot GDP per capita for all countries, highlighting the top 5
ggplot(per_capita_gdp, aes(x = Year, y = gdp_per_capita, group = Country)) +
  geom_line(color = "lightgray", alpha = 0.5) +
  geom_line(data = top_5_data, aes(color = Country), size = 1) + 
  labs(title = "GDP per Capita Across All Countries (Highlight Top 5)",
       x = "Year",
       y = "GDP per Capita") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Monaco has the top GDP per capita in the latest year of the dataset and overall, however we see that there has been a bit of back and forth between Monaco and Liechtenstein for the highest GDP per capita very recently. However, it’s consistently Monaco at the top, followed by Liechtenstein followed by Luxembourg at the very top of the GDP per capita for the past 30 years – with Monaco topping Liechtenstein for almost every year in the past 30 years, with only a few exception years when Liechtenstein is visually at or above Monaco for GDP per capita.

3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from `global_economy`.

global_economy |> 
  filter(Country == "United States") |> 
  ggplot(aes(x = Year, y = GDP)) +
  geom_line(na.rm = TRUE) + 
  labs(title = "GDP for the United States", y = "GDP") +
  theme_minimal()

We’ll transform this to GDP per capita – a population adjustment.

per_capita_gdp |> 
  filter(Country == "United States") |> 
  autoplot(gdp_per_capita) +
  labs(title = "GDP per Capita for the United States",
       x = "Year",
       y = "GDP per Capita") +
  theme_minimal()

We did this transformation because it makes more sense to look at GDP per capita rather than GDP overall. That said, the two graphs look extremely similar in shape. We can tell that the GDP and GDP per Capita is increasing over time at a relatively steady rate, excluding the Great Recession decline visible in both plots.

Slaughter of Victorian “Bulls, bullocks and steers” in `aus_livestock`.

aus_livestock |> 
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria") |> 
  autoplot()

Plot variable not specified, automatically selected `.vars = Count`

Based on the above autoplot, let’s try a log transformation - it may be helpful to stabilize the variance and make the patterns more feasible to interpret.

aus_livestock |> 
  filter(Animal == "Bulls, bullocks and steers", State == "Victoria") |> 
  mutate(log_Count = log(Count)) |> 
  autoplot(log_Count) +
  labs(title = "Log-Transformed Series",
       y = "Log(Count)")

This definitely helps stabilize the variance over time. However, it’s still unclear that the patterns are more easy to interpret. We’d likely want to do more to truly understand what’s going on here, like a STL decomposition, but we’ll save that for later in the assignment.

Victorian Electricity Demand from `vic_elec`.

autoplot(vic_elec, Demand)

Here it seems like a box cox transformation might be useful. The fluctuations are larger when Demand is higher. Although we do also see clear seasonal patterns, but the Demand spikes appear to be increasing over time.

lambda_vic_elec_demand <- vic_elec |> 
  features(Demand, features = guerrero) |> 
  pull(lambda_guerrero)

print(lambda_vic_elec_demand)

[1] 0.09993089

vic_elec |> 
  mutate(boxcox_Demand = box_cox(Demand, lambda_vic_elec_demand)) |> 
  autoplot(boxcox_Demand) +
  labs(title = latex2exp::TeX(paste0(
         "Transformed Victorian Electrcity Demand with $\\lambda$ = ",
         round(lambda_vic_elec_demand,4))))

This successfully helps somewhat stabilize the variance we saw that the spikes increased as the time went on. However, it’s not as useful as it appears to be for the following series for Australian gas production. As for all the transformations, it would be best to go forward with decomposition to determine the full utility of each type of transformation and compare to others as needed.

Gas production from `aus_production`.

autoplot(aus_production, Gas)

We know that the box-cox transformation applied to Australian quarterly gas production will be a useful transformation. We’ll again use the guerrero feature to choose a value of lambda for us.

lambda_aus_production <- aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Gas, lambda_aus_production)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda_aus_production,4))))

This is clearly a valuable transformation and helps stabilize the overall trend and variance over time.

3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

autoplot(canadian_gas, Volume)

A box-cox transformation is unhelpful for the canadian_gas data for a few reasons. The trend was already clearly linear, and that doesn’t change when we apply the below box-cox transformation. In addition, the variance doesn’t change drastically over time in the above. While the seasonal fluctuations do at times appear to grow as the trend increases, this is not a stable happening, and it’s not significant enough to warrant the utility of the box-cox transformation. We can see in the below that not terribly much changes when we apply the box-cox transformation using the guerrero feature.

lambda_canadian_gas <- canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)
canadian_gas |>
  autoplot(box_cox(Volume, lambda_canadian_gas)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Canadian gas production with $\\lambda$ = ",
         round(lambda_canadian_gas,4))))

3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

We didn’t previously do exercise 7 in Section 2.10, so let’s do that first. The prompt was: Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value) and explore the retail time series – note if you can spot seasonality, cyclicity and trend:

set.seed(5889)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Note: I changed the seed above.

autoplot(myseries)

Plot variable not specified, automatically selected `.vars = Turnover`

myseries |> 
  gg_subseries()

Plot variable not specified, automatically selected `y = Turnover`

myseries |> 
  gg_season()

Plot variable not specified, automatically selected `y = Turnover`

We can clearly see an increase over time (trend) as well as seasonality visible in recurring increases (peaks) each December and declines (troughs) each January/February. We can also see that the variance increases over time making this a good candidate for box-cox transformation. Additionally, seasonality appears to grow as the values increase, another indicator that box-cox transformation is likely useful here. So, let’s do that - and find the best lambda using the guerrero feature as we did previously.

lambda_aus_retail <- myseries |> 
  features(Turnover, features = guerrero) |> 
  pull(lambda_guerrero)

print(lambda_aus_retail)

[1] 0.08542275

The recommended lambda for the transformation is 0.085. Let’s check it out.

myseries |>
  autoplot(box_cox(Turnover, lambda_aus_retail)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Australian Retail Turnover with $\\lambda$ = ",
         round(lambda_aus_retail,4))))

This definitely helps stabilize the variance and seasonality, compared with the time series we initially observed using autoplot() above.

3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

Once again we will use the recommended lambda using the guerrero feature for each of the following series. This enables us to optimize stabilization of the variance.

Tobacco from `aus_production`

#  Tobacco from `aus_production` not transformed:
aus_production |> 
  autoplot(Tobacco)

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

#  Tobacco from `aus_production` transformed:
lambda_tobacco_aus_production <- aus_production |>
  features(Tobacco, features = guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Tobacco, lambda_tobacco_aus_production)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Australian Tobacco production with $\\lambda$ = ",
         round(lambda_tobacco_aus_production,4))))

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_line()`).

The lambda for this box-cox transformation using the guerrero feature was 0.9265. Since the lambda is so close to 1, it’s unclear that a transformation is actually useful or needed here. Additionally, we can see that the shape of the time series does not change very much when using this box-cox transformation. So, perhaps we should’ve first calculated the guerrero lambda to determine it’s utility prior to going ahead and plotting it. Either way, it helped us gain some valuable practice!

Economy class passengers between Melbourne and Sydney from `ansett`

After running View(ansett) to view the dataset and ?ansett to understand more about what we’re looking at, we know that we need to identify the airport code for the Airports column in the ansett tsibble unique to ‘between Melbourne and Sydney’. We’ll also need to filter on Class to ‘Economy’ passengers only. Let’s do that first.

economy_MEL_SYD <- ansett |> 
  filter(Airports == "MEL-SYD", Class == "Economy")

#Economy class passengers between Melbourne and Sydney: Not transformed
autoplot(economy_MEL_SYD, Passengers)

#Economy class passengers between Melbourne and Sydney: Transformed
lambda_economy_MEL_SYD <- economy_MEL_SYD |>
  features(Passengers, features = guerrero) |>
  pull(lambda_guerrero)
economy_MEL_SYD |>
  autoplot(box_cox(Passengers, lambda_economy_MEL_SYD)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Economy class passengers MEL-SYD with $\\lambda$ = ",
         round(lambda_economy_MEL_SYD,4))))

print(lambda_economy_MEL_SYD)

[1] 1.999927

Using the Guerrero method to optimize the lambda (using the guerrero feature) the estimated box-cox lambda was 1.9, which seems unusual to be greater than 1 at all. It’s actually not clear that a box-cox transformation is most appropriate here, while it does adjust for the period of time when we know there was a pilot’s strike. Instead, a different adjustment would likely be most appropriate- perhaps one that eliminates the pilot’s strike that we learned about in previous chapters which is visible in the drop to zero over several weeks in 1989. Otherwise the box-cox transformation does more to attend to that unusual period of time rather than attend to the variance in the non-pilot-strike periods, which account for the vast majority of the data.

Pedestrian counts at Southern Cross Station from `pedestrian`

First we’ll filter down to just the Southern Cross station in pedestrian:

pedestrian_Southern_Cross <- pedestrian |> 
  filter(Sensor == "Southern Cross Station")

#Pedestrian counts at Southern Cross Station from `pedestrian`: Not transformed
autoplot(pedestrian_Southern_Cross, Count)

# Pedestrian counts at Southern Cross Station from `pedestrian`: Transformed
lambda_pedestrian_Southern_Cross <- pedestrian_Southern_Cross |>
  features(Count, features = guerrero) |>
  pull(lambda_guerrero)
pedestrian_Southern_Cross |>
  autoplot(box_cox(Count, lambda_pedestrian_Southern_Cross)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Canadian gas production with $\\lambda$ = ",
         round(lambda_pedestrian_Southern_Cross,4))))

Another interesting finding - a negative lambda! This lambda (-0.25) is odd and not really expected. It’s unclear that the transformation is useful or valuable in interpreting the data. As we know, there cannot be ‘negative’ pedestrians at any given station and it’s likely that these stations experience rush hour (peak) times and times when there are 0 pedestrians (at night, likely when stations are closed – I don’t think that many other transit systems or cities are as ‘all night’ / 24-7 as New York City). So it appears that the transformation is trying to account for these inherent features of the data, when perhaps they are not the right transformations at all. We’d have to look into this more to understand what’s truly going on, and whether a transformation is at all necessary or useful.

3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)

a. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

autoplot(gas, Gas)

Yes, it appears that there is a seasonal trend with peaks in the Q3 of each year, and an overall trend-cycle increase. Let’s decompose using STL.

gas_STL_model <- gas |> 
  model(stl = STL(Gas))

components(gas_STL_model)

# A dable: 20 x 7 [1Q]
# Key:     .model [1]
# :        Gas = trend + season_year + remainder
   .model Quarter   Gas trend season_year remainder season_adjust
   <chr>    <qtr> <dbl> <dbl>       <dbl>     <dbl>         <dbl>
 1 stl    2005 Q3   221  193.        26.9     0.856          194.
 2 stl    2005 Q4   180  197.       -16.7    -0.109          197.
 3 stl    2006 Q1   171  200.       -25.7    -3.59           197.
 4 stl    2006 Q2   224  204.        15.4     4.86           209.
 5 stl    2006 Q3   233  207.        27.0    -1.03           206.
 6 stl    2006 Q4   192  210.       -16.7    -1.33           209.
 7 stl    2007 Q1   187  213.       -25.5    -0.550          213.
 8 stl    2007 Q2   234  216.        15.1     2.60           219.
 9 stl    2007 Q3   245  219.        27.0    -0.730          218.
10 stl    2007 Q4   205  219.       -16.6     2.55           222.
11 stl    2008 Q1   194  219.       -25.3     0.562          219.
12 stl    2008 Q2   229  219.        14.8    -4.59           214.
13 stl    2008 Q3   249  219.        27.0     2.98           222.
14 stl    2008 Q4   203  220.       -16.6    -0.834          220.
15 stl    2009 Q1   196  222.       -25.0    -0.740          221.
16 stl    2009 Q2   238  223.        14.5     0.341          223.
17 stl    2009 Q3   252  225.        27.1    -0.132          225.
18 stl    2009 Q4   210  226.       -16.5     0.986          227.
19 stl    2010 Q1   205  226.       -24.8     4.25           230.
20 stl    2010 Q2   236  225.        14.2    -3.62           222.

components(gas_STL_model) |> autoplot()

There is most definitely a visible seasonal and a trend-cycle component to this time series.

b. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas |>
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical multiplicative decomposition of gas production
                  from the last 5 years of aus_production")

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

c. Do the results support the graphical interpretation from part a?

Yes, the results from the classical multiplicative decomposition (as well as the STL decomposition I included) confirm the graphical interpretation from part a. Both classical multiplicative decomposition and STL decomposition identify a clear trend-cycle that shows an increase in gas production over time (over the 5 years included in the time series). The seasonal component also confirms that peaks occur in the third quarter of each year, as observed in the original plot using autoplot(). The results support my initial assessment.

d. Compute and plot the seasonally adjusted data.

Below we will plot the seasonal component

gas_mult_decomp <- gas |> 
  model(classical = classical_decomposition(Gas, type = "multiplicative"))  |> 
  components()

gas_mult_decomp  |> 
  autoplot(seasonal) +
  labs(title = "Seasonal Component of Gas Production 
       (Multiplicative Decomposition)",
       y = "Seasonal Effect",
       x = "Quarter")

Then we’ll compute and plot the seasonally adjusted data. We do this by removing the seasonal component from the original data.

gas_mult_decomp_seasonal_adjust <- gas_mult_decomp |> 
  mutate(seasonally_adjusted = Gas / seasonal)

gas_mult_decomp_seasonal_adjust |> 
  autoplot(seasonally_adjusted) +
  labs(title = "Seasonally Adjusted Gas Production",
       y = "Seasonally Adjusted Gas",
       x = "Quarter")

This is really interesting to see how significant of an impact the seasonality has on the gas production. It confirms that the seasonal patterns play a notable role in the original data. The seasonally-adjusted data provides us a view into the underlying trend-cycle without the influence of the seasonal fluctuations, which has the potential to contribute to a more informative long-term analysis due to the lack of seasonal impacts.

e. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

First will create a copy of gas and introduce the manually-created outlier, and then recompute the seasonally adjusted data as requested in the question prompt.

gas_with_outlier <- gas

gas_with_outlier$Gas[4] <- gas_with_outlier$Gas[4] + 900

Now we perform the classical decomposition:

gas_with_outlier_decomp <- gas_with_outlier |> 
  model(classical = classical_decomposition(Gas, type = "multiplicative")) |> 
  components()

Now we will remove the seasonal component to see the impact of the outlier on the seasonally-adjusted data:

gas_with_outlier_decomp_adjusted <- gas_with_outlier_decomp |> 
  mutate(seasonally_adjusted = Gas / seasonal)

Finally, we plot:

gas_with_outlier_decomp_adjusted |> 
  autoplot(seasonally_adjusted) +
  labs(title = "Seasonally Adjusted Gas Production With Outlier",
       y = "Seasonally Adjusted Gas",
       x = "Quarter")

Wow! Introducing an outlier has an outsize impact on the seasonally adjusted data. This makes the trend component shift very noticeably. No longer are we seeing an increasing trend; we now see a spike followed by a relatively “steady” or flat, not increasing, overall trend. The remainder plus the trend is clearly very impacted by that one introduced outlier. This is very important to see the effect of outliers when using classical decomposition.

f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Well, let’s see - if we introduce the same amount increase for an outlier more toward the middle of the time series.

gas_with_outlier2 <- gas

gas_with_outlier2$Gas[11] <- gas_with_outlier2$Gas[11] + 900

Now we perform the classical decomposition:

gas_with_outlier_decomp2 <- gas_with_outlier2 |> 
  model(classical = classical_decomposition(Gas, type = "multiplicative")) |> 
  components()

Now we will remove the seasonal component to see the impact of the outlier on the seasonally-adjusted data:

gas_with_outlier_decomp_adjusted2 <- gas_with_outlier_decomp2 |> 
  mutate(seasonally_adjusted = Gas / seasonal)

Finally, we plot:

gas_with_outlier_decomp_adjusted2 |> 
  autoplot(seasonally_adjusted) +
  labs(title = "Seasonally Adjusted Gas Production With Outlier",
       y = "Seasonally Adjusted Gas",
       x = "Quarter")

In short: no - the significant effect of the manually introduced outlier can still be seen in the same light as when we introduced it at the beginning end of the gas data. The outlier totally distorts the seasonally adjusted data when using a multiplicative classical decomposition method.

3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

Let’s remind ourselves what the autoplot of myseries looked like.

autoplot(myseries)

Plot variable not specified, automatically selected `.vars = Turnover`

Now we’ll perform the x-11 decomposition.

x11_dcmp <- myseries |>
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components()
autoplot(x11_dcmp) +
  labs(title =
    "Decomposition of total AUS retail turnover using X-11.")

Wow! This definitely makes some elements stand out that were not noticeable previously. A few observations: Looking at the seasonal component, the pattern is pretty consistent over time, while the trend component shows there’s a steady increase but with some flatter periods (which go somewhat downward, though not enough to supersede the overall trend) like between ~2002 and 2012. The “flat” period is more between ~2002 and 2008, while between 2008 and 2012/2013 is the relatively minor downward period. We could call this a “slowing” of the trend which was not obvious previously. It’s followed again by an upward trend which reflects the overall movement of the trend-cycle. Another noteworthy observation i, the irregular component which seems to have several large spikes which indicates unexpected outliers that appear in the early to mid 1980’s. This was not at all obvious in the previous view of the data, which indicated there were not irregular/residuals at the start of the time series at all. In fact, it seemed like there was just a “general” upward trend. All of these observations were not noticeable with the box-cox lambda using the Guerrero method, either. This really hits home at the value of decomposition using x-11.

3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Figure 3.19: Decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Figure 3.20: Seasonal component from the decomposition shown in the previous figure.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The results of the STL decomposition show a notable trend: there is a steady increase over time of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995. The scale is similar to the original series based on the scale indicator, which further adds to the trend as a significant influence in the overall time series. The seasonal component shows a pattern which is consistent over tie, where March and December show the largest seasonal spikes in persons employed. However, the seasonal scale (-100 to 100) is a significantly smaller scale than the trend scale (6500 to 9000), which indicates that the seasonal patterns contribute but not nearly as significantly as the overall trend. When considering the remainder component, we notice the scale is from -250 to 50, a noticeably more ‘negative’ scale and also a small scale again compared to the trend. We also notice there is a significant dip in the early 1990s which leads us to our next question.

Is the recession of 1991/1992 visible in the estimated components?

Yes, the recession of 1991/1992 is clearly visible in the components, in particular in the remainder component. It is highly visible. We can also see it in the seasonal component from the decomposition, in Figure 3.20. In particular in the months of March, April, May, and July, there is a noticeable drop in the seasonal trends from the STL decomposition. I learned a lot through all these exercises and observations and look forward to utilizing time series decomposition further.

Loading the Data

3.1

3.2

United States GDP from global_economy.

Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

Victorian Electricity Demand from vic_elec.

Gas production from aus_production.

3.3

3.4

3.5

Tobacco from aus_production

Economy class passengers between Melbourne and Sydney from ansett

Pedestrian counts at Southern Cross Station from pedestrian

3.7