Homework 2

Author

Naomi Buell

library(tidyverse)
library(janitor)
library(fpp3)
library(USgas)
library(readxl)
library(feasts)
library(ggplot2)
library(scales)
library(latex2exp)
library(seasonal)
library(lubridate)

set.seed(9219)

3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

global_gdp_per_cap <- global_economy |>
  mutate(GDP_per_cap = GDP / Population / 1000) |>
  arrange(desc(GDP_per_cap)) |>
  filter(!(
    # Removing non-countries
    Code %in% c(
      "WLD",
      "IBT",
      "LMY",
      "MIC",
      "IBD",
      "EAR",
      "LMC",
      "UMC",
      "EAS",
      "LTE",
      "EAP",
      "TEA",
      "SAS",
      "TSA",
      "IDA",
      "HIC",
      "OED",
      "PST",
      "ECS",
      "NAC",
      "EUU",
      "EMU",
      "LCN",
      "TLA",
      "LAC",
      "TEC",
      "ECA",
      "MEA",
      "ARB"
    )
  )) |>
  select(Country, GDP_per_cap)

# Top per-capita GDPs all time
global_gdp_per_cap |> head()
# A tsibble: 6 x 3 [1Y]
# Key:       Country [2]
  Country       GDP_per_cap  Year
  <fct>               <dbl> <dbl>
1 Monaco               185.  2014
2 Monaco               181.  2008
3 Liechtenstein        179.  2014
4 Liechtenstein        174.  2013
5 Monaco               173.  2013
6 Monaco               168.  2016
# Top per-capita GDPs in 2017
global_gdp_per_cap |> filter(Year == 2017) |>  head()
# A tsibble: 6 x 3 [1Y]
# Key:       Country [6]
  Country          GDP_per_cap  Year
  <fct>                  <dbl> <dbl>
1 Luxembourg             104.   2017
2 Macao SAR, China        80.9  2017
3 Switzerland             80.2  2017
4 Norway                  75.5  2017
5 Iceland                 70.1  2017
6 Ireland                 69.3  2017
# Top country each year
global_gdp_summ <- global_gdp_per_cap |> 
  as_tibble() |>  
  group_by(Year) |> 
  slice_max(order_by = GDP_per_cap, n = 1, with_ties = FALSE) |> 
  ungroup()

global_gdp_summ |> head()
# A tibble: 6 × 3
  Country       GDP_per_cap  Year
  <fct>               <dbl> <dbl>
1 United States        3.01  1960
2 United States        3.07  1961
3 United States        3.24  1962
4 United States        3.37  1963
5 United States        3.57  1964
6 Kuwait               4.43  1965
# Plot
global_gdp_per_cap |> autoplot() +
  theme_classic() +
  theme(legend.position = "none") +
  labs(title = "GDP per capita by Country", x = "Year", y = "GDP per capita (1k 2019 USD)", subtitle = "Monaco has the highest GDP per capita, historically.")

Monaco has generally been the world leader in per-capita GDP throughout history. In the 60’s, the US had the highest per-capita GDP, but this is due to data missing for Monaco prior to 1970. Monaco had the highest GDP per capita of all time in 2014, with Liechtenstein in second, and Luxembourg in third that year. However, in the most recent year of data, 2017, Luxembourg is highest, with Macao SAR, China in second, and Switzerland in third. The UAE beat Monaco in ’76 and ’77, and Liechtenstein beat Monaco in 2013 and 2015.

3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

  • United States GDP from global_economy.

    # Plot
    global_economy |> filter(Country == "United States") |> mutate(GDP = GDP/1e12) |> 
      autoplot() +
      theme_classic() +
      theme(legend.position = "none") +
      labs(title = "U.S. GDP trends up, with slight dip during the Great Recession.", x = "Year", y = "GDP (trillions 2019 USD)")

    I transformed the units of the y-axis to trillions to make the large numbers easier to read.

  • Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

    bulls <- aus_livestock |> filter(Animal == "Bulls, bullocks and steers") |> select(-Animal) |> mutate(Count = Count/10000) |> filter(Count > 0)
    bulls |> head()
    # A tsibble: 6 x 3 [1M]
    # Key:       State [1]
         Month State                        Count
         <mth> <fct>                        <dbl>
    1 1976 Jul Australian Capital Territory  0.23
    2 1976 Aug Australian Capital Territory  0.21
    3 1976 Sep Australian Capital Territory  0.21
    4 1976 Oct Australian Capital Territory  0.19
    5 1976 Nov Australian Capital Territory  0.21
    6 1976 Dec Australian Capital Territory  0.18
    bulls |>
      autoplot() +
      theme_classic() +
      labs(title = "The Northern Territory leads in killing bulls for meat.", x = "Year", y = "10k bulls, bullocks, or steers slaughtered", subtitle = "There are distinct and consistent seasons of meat production.")

    I transformed the units of the y-axis to 10k bulls to make the large numbers easier to read.

  • Victorian Electricity Demand from vic_elec.

    # Aggregate to weekly data
    electricity_demand_weekly <- vic_elec |> 
      index_by(Week = yearweek(Time)) |> 
      summarize(Demand = sum(Demand) / 100000) |> 
      slice(-1, -n()) # Remove the first and last incomplete weeks showing misleadingly low totals
    
    # Plot weekly data
    electricity_demand_weekly |>
      autoplot() +
      theme_classic() +
      scale_x_yearweek(date_breaks = "2 month", date_labels = "%b %Y") +
      labs(title = "Weekly electricity demand in Victoria, Australia", x = "", y = "Total electricity demand (100k MWh)") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    I transformed the data to remove weekly variations by transforming the y-axis to show weekly total demand (in 100k MWh). There is a consistent annual seasonal pattern, which drops during the December holiday, then spikes in January and February (summer), and remains elevated in June and July (winter).

  • Gas production from aus_production.

    # Plot un-transformed data:
    gas <- aus_production |> select(Gas)
    gas |>
      autoplot() +
      theme_classic() +
      labs(subtitle =  "Production and magnitude of seasonal variation start going up in the 70's.", 
           x = "", 
           y = "Gas production (petajoules)") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      scale_x_yearquarter(date_breaks = "2 year", date_labels = "%Y")

    I chose not to transform this data. Annual seasonality is evident, and the magnitude of its variability increases in the 70’s, along with the overall level of production. I considered transforming the data using the box-cox transformation (lambda = 0.11), to adjust the size of the seasonal variation to be about the same across the whole series, since originally, the variation that increased over the time. However, this made the plot more difficult to interpret.

3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

# Graph raw data
canadian_gas |> head()
# A tsibble: 6 x 2 [1M]
     Month Volume
     <mth>  <dbl>
1 1960 Jan   1.43
2 1960 Feb   1.31
3 1960 Mar   1.40
4 1960 Apr   1.17
5 1960 May   1.12
6 1960 Jun   1.01
canadian_gas |> autoplot() +
    labs(x = "",
         y = "Gas production (billions of cubic meters)", 
       title =  "Raw Monthly Canadian gas production") +
  theme_classic() 

# Determine appropriate lambda for box cox transformation
lambda <- canadian_gas |>
  features(Volume, features = guerrero) |>
  pull(lambda_guerrero)

# Plot box cox transformed data
canadian_gas |>
  autoplot(box_cox(Volume, lambda)) +
  labs(y = "Gas production (billions of cubic meters)", 
       x = "",
       title = "Box-Cox transformed monthly Canadian gas production") +
  theme_classic()

The Box-Cox transformation is less helpful for this data series because the change in magnitude of the seasonality is not increasing or decreasing constantly with time. Instead, as you see in the graph of the raw data, seasonal variation increases in the late 70s and 80s, then goes back down in the 90s. The shape of the transformed data looks very similar to the raw, and the transformation doesn’t improve consistency in seasonality over time.

3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

# get random retail series
random_id <- sample(aus_retail$`Series ID`, 1)
myseries <- aus_retail |>
  filter(`Series ID` == random_id)
industry_name <- myseries$Industry[1]

# Determine appropriate lambda for box cox transformation
lambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

# Plot box cox transformed data
myseries |>
  autoplot(box_cox(Turnover, lambda)) +
  labs(
    x = "",
    title = "Australian retail trade turnover",
    subtitle = industry_name,
    y = latex2exp::TeX(paste0(
      "Transformed monthly retail turnover ($\\lambda$ = ", round(lambda, 2), ")"
    ))) +
  theme_classic() +
  scale_x_yearmonth(breaks = "2 year", labels = label_date("%Y")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I would use the Box-Cox transformation with lambda = 0.18 for the cafes, restaurants and takeaway food services industry, based on the guerrero feature. The trend is increasing, with slowed turnover in the early 90’s and a slight drop in the early 2010’s.

3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

# Extract the series
Tobacco <- aus_production |> select(Tobacco)
Passengers <- ansett |> filter(Class == "Economy", Airports == "SYD-MEL" | Airports == "MEL-SYD") |> select(Passengers)
Count <- pedestrian |> filter(Sensor == "Southern Cross Station") |> select(Count)
series_list <- list(Tobacco, Passengers, Count)

# Loop through each series to print plot
for (s in series_list) {
  # Determine appropriate lambda for Box-Cox transformation
  lambda <- s |>
    features(!!sym(names(s)[1]), features = guerrero) |>
    pull(lambda_guerrero)
  print(
    paste0(
      round(lambda, 5),
      " is the lambda to stabilize the variance in ",
      names(s)[1] |> str_to_lower(),
      "."
    )
  )
  
  # Plot Box-Cox transformed data
  p <- s |> 
    autoplot(box_cox(!!sym(names(s)[1]), lambda)) +
    labs(
      title = latex2exp::TeX(
        paste0(
          "Transformed ",
          names(s)[1],
          " ($\\lambda$ = ",
          round(lambda, 2),
          ")"
        )
      )
    ) +
    theme_classic()
  print(p)
}
[1] "0.92646 is the lambda to stabilize the variance in tobacco."

[1] "1.99993 is the lambda to stabilize the variance in passengers."

[1] "-0.25016 is the lambda to stabilize the variance in count."

I determine the appropriate Box-Cox transformation parameter using the guererro feature, printing lambdas and visualizations of transformed series above.

3.7

  1. Consider the last five years of the Gas data from aus_production.

    gas <- tail(aus_production, 5*4) |> select(Gas)
    1. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
    gas |>
      autoplot() +
      theme_classic() +
      labs(subtitle = "Gas production is seasonal (low: Q4-Q1, high Q2-Q3), trending up.", x = "", y = "Gas production (petajoules)") +
      scale_x_yearquarter(breaks = unique(gas$Quarter), labels = label_date("%Y Q%q")) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    Yes. Gas production appears seasonal (low in Q1 and Q4, high in Q2 and Q3). The trend cycle goes up over time.

    1. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
    gas_decomp <- gas |>
      model(classical_decomposition(Gas, type = "multiplicative")) |>
      components() 
    
    gas_decomp |> select(trend, seasonal) |> head(5*4)
    # A tsibble: 20 x 3 [1Q]
       trend seasonal Quarter
       <dbl>    <dbl>   <qtr>
     1   NA     1.13  2005 Q3
     2   NA     0.925 2005 Q4
     3  200.    0.875 2006 Q1
     4  204.    1.07  2006 Q2
     5  207     1.13  2006 Q3
     6  210.    0.925 2006 Q4
     7  213     0.875 2007 Q1
     8  216.    1.07  2007 Q2
     9  219.    1.13  2007 Q3
    10  219.    0.925 2007 Q4
    11  219.    0.875 2008 Q1
    12  219     1.07  2008 Q2
    13  219     1.13  2008 Q3
    14  220.    0.925 2008 Q4
    15  222.    0.875 2009 Q1
    16  223.    1.07  2009 Q2
    17  225.    1.13  2009 Q3
    18  226     0.925 2009 Q4
    19   NA     0.875 2010 Q1
    20   NA     1.07  2010 Q2
    gas_decomp |>
      autoplot() +
      labs(
        title = "Classical multiplicative decomp: Australian gas production",
        subtitle = "Gas production is seasonal (low: Q4-Q1, high Q2-Q3), trending up.",
        x = "",
        y = "Gas production (petajoules)"
      ) +
      theme_classic() +
      scale_x_yearquarter(breaks = unique(gas$Quarter), labels = label_date("%Y Q%q")) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    I display the trend-cycle and seasonal indices in tabular form before graphing all components of the decomposition above.

    1. Do the results support the graphical interpretation from part a?

    Yes, the trend-cycle is trending upward, as interpreted in part a. The seasonality displayed is also consistent with part a (low in Q1 and Q4, high in Q2 and Q3).

    1. Compute and plot the seasonally adjusted data.
    gas_decomp |>
      select(season_adjust) |>
      autoplot() +
      labs(
        title = "Seasonally adjusted gas production",
        x = "",
        y = "Gas production (petajoules)"
      ) +
      theme_classic() +
      scale_x_yearquarter(breaks = unique(gas$Quarter), labels = label_date("%Y Q%q")) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    1. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
    gas_outlier <- gas |>
      mutate(Gas = case_when(Quarter == yearquarter("2010 Q2") ~ Gas + 300, TRUE ~ Gas))
    
    gas_outlier |> 
      model(classical_decomposition(Gas, type = "multiplicative")) |>
      components()  |>
      select(season_adjust) |>
      autoplot() +
      labs(
        title = "Seasonally adjusted gas production, with outlier in Q2 of 2010.",
        x = "",
        y = "Gas production (petajoules)"
      ) +
      theme_classic() +
      scale_x_yearquarter(breaks = unique(gas$Quarter), labels = label_date("%Y Q%q")) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    The outlier can be seen clearly in the seasonally-adjusted data, since it contains the remainder component (noise from outliers are not removed by seasonal adjustments–just the variation due to seasonality).

    1. Does it make any difference if the outlier is near the end rather than in the middle of the time series?
gas_outlier_end <- gas |>
  mutate(Gas = case_when(Quarter == yearquarter("2008 Q1") ~ Gas + 300, TRUE ~ Gas))
  
gas_outlier_end |> 
  model(classical_decomposition(Gas, type = "multiplicative")) |>
  components()  |>
  select(season_adjust) |>
  autoplot() +
  labs(
    title = "Seasonally adjusted gas production, with outlier in Q1 of 2008.",
    subtitle = "Regardless if in the middle or end of the data, trend is skewed by outlier.",
    x = "",
    y = "Gas production (petajoules)"
  ) +
  theme_classic() +
  scale_x_yearquarter(breaks = unique(gas$Quarter), labels = label_date("%Y Q%q")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

No, it does not make any difference whether the outlier is at the end or in the middle of the data–data is still skewed by outliers, because the seasonally adjusted series contains the remainder component as well as the trend-cycle.

3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

myseries |>
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) |>
  components() |> autoplot() +
  labs(
    title = "X-11 decomposition of Australian retail trade turnover",
    subtitle = industry_name) +
  theme_classic() +
  scale_x_yearmonth(breaks = "2 year", labels = label_date("%Y")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Previously, from the box-cox data, I noted the trend was increasing, with slowed turnover in the early 90’s and a slight drop in the early 2010’s. This observation is consistent with the trend-cycle graph depicted above. Something that I could only see clearly with this decomposed version is that the seasonality has relatively higher peaks in the 80’s and early 90s, then the height of peaks decrease in the mid-90’s, and seasonal troughs get lower in 2009 and onward.

3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Figure 3.19: Decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Seasonal component from the decomposition shown in the previous figure.

Figure 3.20: Seasonal component from the decomposition shown in the previous figure.

  1. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

    The trend in the Australian labor force goes upwards from 1978 to 1995. There is seasonality: January and August have the lowest counts, corresponding to holiday periods and reduced employment activity. There is a shock in 1991, which can be seen as a dip in the overall data values and a major low outlier in the remainder component (this aligns with the early 1990s recession in Australia). The magnitude of seasonal variation increases with time according to the season_year component graphed above.

  2. Is the recession of 1991/1992 visible in the estimated components?

    Yes, it is visible in the dip in the raw data, as well as the much larger dip in the remainder.