HW2 - Time series decompoistion

Exercises from Section 3 of Forecasting: Principles & Practice at https://otexts.com/fpp3/decomposition-exercises.html

Exercise 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

For the most recent year of 2016 Monaco has the highest GDP per capita with Liechtenstein just behind it. These two have been ahead of the pack since around 1985 when they broke out of the masses, with Monaco usually behind a fair ways above Liechtenstein.

ggplotly(
  autoplot(global_economy, GDP/Population) +
    theme(legend.position = "none"))

Exercise 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy.

I’ve transformed the y-axis to show GDP per capita so the growth is within the context of the population. I don’t believe an inflation adjustment is needed as it appaears from the documentation that GDP has been adjusted to show with reference to February 2019 USD.

global_economy %>%
  filter(Country == "United States") %>%
  autoplot(GDP/Population) +
  labs(title= "GDP per capita", y = "$US")

Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.

For this data I’ve chosen a stacked bar chart so the overall trend is apparent but it’s also possible to roughly see the share each state had in the slaughter. I’ve also changed the y-axis to show the count of slaughters in thousands as I find it more easy to read than the scientific notation or large numbers.

aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers") %>%
    ggplot(aes(x = Month, y = (Count/1000), fill = State)) +
      geom_col() +
      labs(title="Slaughter of 'Bulls, bullocks and steers' in Australian States, by Year") +
      ylab('# Slaughted (by thousands)') +
      xlab('Month')

Victorian Electricity Demand from vic_elec.

Due to the variables in this dataset there are multiple ways to graph this but I chose to show daily electricity demand over time, with a color highlight for holiday days. I also chose to show demand in the hundreds, again to make the scale easier to interpret. Depending on your goal, this data could also be graphed showing time of day demand.

ggplotly(vic_elec %>%
    ggplot(aes(x = Date, y = (Demand/100), fill = Holiday)) +
      geom_col() +
      labs(title="Electricity Demand in Victoria by Day, Holidays in Teal") +
      ylab('Electricity Demand in MW (by hundreds)') +
      xlab('Day'))

Gas production from aus_production.

In reviewing section 3.1 of the textbook I use the guerrero feature they introduced to find a suitable lambda, one that makes the size of the seasonal variation about the same across the whole series. This lambda is passed into the box_cox function along with the Gas variable to make the following plot.

lambda <- aus_production %>%
  features(Gas, features = guerrero) %>%
  pull(lambda_guerrero)

ggplotly(aus_production %>%
    autoplot(box_cox(Gas, lambda)) +
     labs(y = "",
       title = "Transformed gas production with lambda = 0.12"))

Exercise 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

In looking at the raw data in the first plot, compared to the transformed data we see it wasn’t able to make the seasonal variation as uniform as we would hope. This appears to be because the affect of seasonality genuinely changed over the years, with particularly large seasonality difference from around 1075-1990.

canadian_gas %>%
    autoplot()

lambda <- canadian_gas %>%
  features(Volume, features = guerrero) %>%
  pull(lambda_guerrero)

canadian_gas %>%
    autoplot(box_cox(Volume, lambda)) +
     labs(y = "",
       title = "Monthly Canadian Gas Production, by billions of cubic metres",
       subtitle = "transformed with lambda = 0.39")

Exercise 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?

Since we do see variation that increases as the years go on, it should follow that some transformation could be helpful.

set.seed(8675309)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(myseries)

Using the guerrero feature to find a lambda on this dataset, the result is 0.24. This lambda falls the transformed data being simple a log of Turnover, or the square root of Turnover. As a log is more interpretable I would likely choose to transfer the data by logging the Turnover variable.

lambda <- myseries %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)

lambda

## [1] 0.2426336

We do see more stability with the seasonal variation after plotting with the log transformation on the Turnover variable.

ggplot(myseries, aes(x = Month, y = (log(Turnover)))) +
  geom_line()

Exercise 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

Tobacco from aus_production

With a lambda of 0.93, and a look at the raw data plotted, there is no transformation needed. A lambda of exactly 1.0 indicates that the time data perfectly predicts tobacco production, our lambda isn’t quite there but it’s somewhat close.

lambda <- aus_production %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

# original data
autoplot(aus_production, Tobacco)

Economy class passengers between Melbourne and Sydney from ansett

With a lambda of 2.0 (1.99999 rounded up) the appropriate transformation is y squared. The raw plot and the plot with this transformation are shown, and the transformed data appears to have more consistent variance across the weeks.

lambda <- ansett %>%
  filter(Airports == "MEL-SYD",
         Class == "Economy") %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

# original data
ansett %>%
  filter(Airports == "MEL-SYD",
         Class == "Economy") %>%
  mutate(Passengers = Passengers/1000) %>%
    autoplot() +
  labs(y = "Passengers (in thousands)",
       x = "Week",
       title = "# Economy Class Passengers between Melborune and Sydney, by Week")

# trasnformed via box_cox, a y-squared
ansett %>%
  filter(Airports == "MEL-SYD",
         Class == "Economy") %>%
  autoplot(box_cox(Passengers, lambda)) +
  labs(y = "Passengers (in thousands), transformed as y^2",
       x = "Week",
       title = "# Economy Class Passengers between Melborune and Sydney, by Week")

Pedestrian counts at Southern Cross Station from pedestrian

With a lambda of -0.23 it’s just slightly closer to a log transformation of y, as opposed to a 1 / square root of y as the most appropriate transformation. As a log is easier to interpret, that might be the best decision. From the plots below it appears the log transformation gives the lower Count of pedestrians more room to breath so the variance appears more equal.

lambda <- pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)

# original data
 pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
    ggplot(aes(x = Date, y = Count)) +
    geom_line() +
    labs(y = "# of Pedestrians",
       x = "Day",
       title = "# Pedestrians at Sourthern Cross Station, by day")

# trasnformed via box_cox, a y-squared
pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
  autoplot(box_cox(Count, lambda)) +
  labs(y = "# of Pedestrians, log transformation",
       x = "Day",
       title = "# Pedestrians at Sourthern Cross Station, by day")

Exercise 3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) %>% select(Gas)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

Yes Q1 appears to be the lowest and Q3 appears to be the high point in each season. Further from just a visual inspection there appears to be a slight increasing trend as both the lows and highs are getting higher.

gas %>%
  filter(Quarter >= make_yearquarter(year = 2006, quarter = 1)) %>%
  autoplot()

Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas %>%
  filter(Quarter >= make_yearquarter(year = 2006, quarter = 1)) %>%
  model(
    classical_decomposition(Gas, type = "multiplicative")
    ) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical multiplicative decomposition of Gas 2006-2010")

Do the results support the graphical interpretation from part a?

Yes, I believe so. I see the upward trend in this graphic, though it did stall out some around 2008. Further the season pattern is more more evident now that the trend and randomness have been removed from it. On the whole I think both graphics and interpretations agree.

Compute and plot the seasonally adjusted data.

After saving the decomposed information into gas_dcmp I plot the season_adjust line in orange against the original data. This shows what the data look like when the variation from the seasonality is removed by the decomposition process.

gas_dcmp <- gas %>%
  filter(Quarter >= make_yearquarter(year = 2006, quarter = 1)) %>%
  model(
    classical_decomposition(Gas, type = "multiplicative"))

components(gas_dcmp) %>%
  as_tsibble() %>%
  autoplot(Gas, colour="gray") +
  geom_line(aes(y=season_adjust), colour = "#D55E00")

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

From this example it appears the seasonal adjustment is not very robust to outliers, as the orange line of the seasonally adjusted data spikes almost as high as the raw data.

#changing the 4th Gas value to 300
gas$Gas[4] <- 300

# perform decomposition
gas_dcmp <- gas %>%
  filter(Quarter >= make_yearquarter(year = 2006, quarter = 1)) %>%
  model(
    classical_decomposition(Gas, type = "multiplicative"))

#plot the seaon_adjust of the decomposition against original data
components(gas_dcmp) %>%
  as_tsibble() %>%
  autoplot(Gas, colour="gray") +
  geom_line(aes(y=season_adjust), colour = "#D55E00")

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

My first adjustment of an outlier above was the 4th datapoint in the series, below I place it closer to the middle at the 10th datapoint in the series. It appears to have even further affected the seasonally adjusted data, as there is much more variation before/after the outlier compared to the more flattened seasonally adjusted line seen above. An outlier near the middle of time series data is trouble when it comes ot seasonally adjusted data.

# pulling in original Gas data again 
gas <- tail(aus_production, 5*4) %>% select(Gas)

#changing the 10th Gas value to 300
gas$Gas[10] <- 300

# perform decomposition
gas_dcmp <- gas %>%
  filter(Quarter >= make_yearquarter(year = 2006, quarter = 1)) %>%
  model(
    classical_decomposition(Gas, type = "multiplicative"))

#plot the seaon_adjust of the decomposition against original data
components(gas_dcmp) %>%
  as_tsibble() %>%
  autoplot(Gas, colour="gray") +
  geom_line(aes(y=season_adjust), colour = "#D55E00")

Exercise 3.8

Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

With the decomposition plot, as compared to the plots trials in the last homework on this data, it’s easier to say more about the trend line. Previously I could say it seemed to be increasing, and now I can certainly say it’s increasing but not without some bumps and dips along the way, even after the seasonality and remainder has been removed per the decomposition. Further I can see there is a fair amount of variation in the amount of remainder/irregularity and can see which years those were.

#code to generate dataset the same as in HW1
set.seed(8675309)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))


#load necessary library
library(seasonal)

# perform decomposition
x11_dcmp <- myseries %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()
autoplot(x11_dcmp) +
  labs(title =
    "Decomposition of Monthly Australian Retail Turnover using X-11.")

Exercise 3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

(See image from online textbook.)

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

I see a clear, quite smooth increasing trend. Further, the seasonality is quite abrupt, there appears to be a strong drop than jump that repeats. The remainder appears to account for a large outlier around 1992, which looks like it even stalled the trend line’s increases a little.

With regard to the scales, I think it’s accurate to say that the seasonal affect on the data certainly exists, but is quite small. The outlier around 1992, while it doesn’t look that drastic on the raw plot, was actually quite out of the ordinary for this dataset because the rest appears quite predictable.

Is the recession of 1991/1992 visible in the estimated components?

Yes, very much so - it’s not as evident in the raw graphic but very visible in the remainder, and once I noticed that I associated it with the leveling out of the upward trend those same years.