Chapter 3, Hyndman and Athanasopoulos

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

Remove rows where records have NA values in GDP, and plot the remaining records for all countries that recorded GDP over time.

Upon visual inspection of the graph above, Monaco has consistently the highest GDP per capita. However, since about 2010 Liechtenstein has been giving Monaco a run for the first place position, and in 2015 it briefy held the top spot.

ggplotly(
  autoplot(global_economy, GDP/Population, na.rm = TRUE) +
    theme(legend.position = "none"))

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy.
Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock.
Victorian Electricity Demand from vic_elec.
Gas production from aus_production.

The global_economy dataset was filtered for the United States GDP only, then the y-axis was transformed to scale in Billions of dollars.

US <-global_economy %>%
  filter(Country == "United States") %>%  # filter for just US GDP
    mutate(GDP = GDP / 1e9)               # divide Cost by 1B for y-axis scaling
  
autoplot(US,GDP) +
  labs(y = "GDP in $US (B)",
       title = "US GDP")

The aus_livestock dataset was filtered for “Bulls, bullocks and steers”, then the y-axis was transformed to scale in the thousands, and the graphics were individually scaled and facet wrapped for better readability within each Australian territory.

AU <-aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers") %>%  # filter for Bulls, bullocks and steers
    mutate(Count = Count / 1e3)               # divide Cost by 1K for y-axis scaling

autoplot(AU,Count) +
  labs(y = "Slaughtered ('000)",
       title = "Slaughter of Australian bulls, bullocks and steers") +
  theme(legend.position = "none") +
  facet_wrap(State ~ ., nrow = 2, ncol = 4, scales = "free_y", strip.position = "bottom")

The vic_elec dataset was simply graphed to show the trend and seasonality of electricity demand over time.

VIC <-vic_elec

ggplotly(autoplot(VIC,Demand) +
  labs(y = "Demand",
       title = "Victoria electricity demand"))

The aus_production dataset was transformed with Box-Cox by finding the optimal lambda of 0.1205 for the transformation. 0.1205 is closest to 0.0, which indicates a log transform in Box-Cox. Then the y-axis was log transformed with Box-Cox and plotted.

lambda <- aus_production %>%
  features(Gas, features = guerrero) %>%
  pull(lambda_guerrero)

aus_production %>%
autoplot(box_cox(Gas,lambda)) +
  labs(y = "",
       title = paste0("Transformed Australian gas production with lambda = ",round(lambda,4)))

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

Box-Cox transformation is unhelpful for the canadian-gas data because, as seen below, the variation in the data shows “bookends” where the variation was less from 1960 to about 1975, then greater from about 1975 until about 1990, and then greatly diminished from 1990 to the present. Because the variation went from small to large to small, the transformations have little overall effect in ‘scaling’ the data that is only larger in the middle.

lambda <- canadian_gas %>%
  features(Volume, features = guerrero) %>%
  pull(lambda_guerrero)

canadian_gas %>%
  autoplot(box_cox(Volume,lambda)) +
  labs(y = "Cubic meters (B)",
       title = paste0("Monthly Canadian gas production with lambda = ",round(lambda,4)))

What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?

This refers to the aus_retail dataset showing retail turnover from 1982 to 2019. The original series is shown below, with an apparent increase in variation increasing from rougly 2008 onward.

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(myseries,Turnover) +
  labs(title = "Turnover by Month")

The Box-Cox transformation suggests that lambda is -0.1028, which is closest to a log transform, but since it is a negative lambda it would be a reciprocal log transform. The effect can be seen in the scale of the y-axis, and the seasonal variation does seem to be more uniformly distributed year-over-year.

set.seed(54321)
lambda <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) %>%
    features(Turnover, features = guerrero) %>%
    pull(lambda_guerrero)

aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) %>%
  autoplot(box_cox(Turnover,lambda)) +
  labs(y = "Turnover",
       title = paste0("Monthly Australian retail turnover with lambda = ",round(lambda,4)))

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.

Tobacco from aus_production raw data looks like this, note the scale from approx 4000 to 8500 tonnes.

aus_production %>%
  autoplot(Tobacco) +
  labs(y = "",
       title = "Australian tobacco production")

The Box-Cox transformation shows a lambda of 0.9289, which suggests no transform is necessary, although there is a possibility that a square root transform could be applied since the lambda is less than 1.0. Plotting the transform with lambda of 0.9289 shows that the y-axis scale has been adjusted downward, albeit not by a square root method.

lambda <- aus_production %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)

aus_production %>%
  autoplot(box_cox(Tobacco,lambda)) +
  labs(y = "",
       title = paste0("'Transformed' Australian tobacco production with lambda = ",round(lambda,4)))

Economy class passengers between Melbourne and Sydney from ansett

Economy passenger volumes between Melbourne (MEL) and Sydney (SYD) on Ansett airlines raw data looks like this, note the scale from approx 0 to 33000 passengers.

ansett %>%
  filter(Class == "Economy", Airports == "MEL-SYD") %>%
    autoplot(Passengers) +
    labs(y = "# Passengers",
         title = "Ansett airlines Economy volumes MEL-SYD")

The Box-Cox transformation shows a lambda of 1.9999, which suggests a \(y^2\) transformation is necessary. Plotting the transform with lambda of 1.9999 shows the effect on the y-axis.

lambda <- ansett %>%
  filter(Class == "Economy", Airports == "MEL-SYD") %>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)

ansett %>%
  filter(Class == "Economy", Airports == "MEL-SYD") %>%
  autoplot(box_cox(Passengers,lambda)) +
  labs(y = "",
       title = paste0("Transformed Ansett airlines Economy volumes MEL-SYD with lambda = ",round(lambda,4)))

Pedestrian counts at Southern Cross Station from pedestrian.

Pedestrian counts at Southern Cross Station raw data looks like this, note the scale from approx 0 to 3500 pedestrians.

pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
    autoplot(Count) +
    labs(y = "# Pedestrians.",
         title = "Hourly Pedestrian counts at Southern Cross Station")

The above graph is very hard to interpret with the total dataset “as is” so additional specificity is needed. Adding a field for Day of the week and setting it as an ordered factor Mon through Sun, the graph below is easier to interpret for Souther Cross Station pedestrian traffic by hour of the day (0 = midnight to 2300 = 11PM) faceted for each day of the week. It is possible to see that there are peak periods of the day during the morning and evening commutes during the week, as well as reduced traffic overall on Sat an Sun, with the exception of some outlying revelers Sat night into the wee hours of Sun morning. Note how the traffic volumes literally change the y-axis scales from weekday to weekend.

pedestrian <- pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
    mutate(Day = wday(Date, label = TRUE)) %>%
      mutate(Day = factor(Day, levels = levels(Day)[c(2:7, 1)]))

ggplot(pedestrian, aes(group = Time, x = Time, y = Count)) + 
  geom_boxplot() +
  facet_wrap(Day ~ ., nrow = 4, ncol = 2, scales = "free_y", strip.position = "bottom") +
  labs(title = "Southern Cross Station pedestrian traffic by hour each day", x = "Hour (2400 clock)")

The Box-Cox transformation shows a lambda of -0.2255 on the hourly by day Southern Cross Station data, which suggests a reciprocal log transformation. Plotting the transform with lambda of -0.2255 shows the variance more evenly distributed throughout each hour of each day.

lambda <- pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
    mutate(Day = wday(Date, label = TRUE)) %>%
      mutate(Day = factor(Day, levels = levels(Day)[c(2:7, 1)])) %>%
        features(Count, features = guerrero) %>%
        pull(lambda_guerrero)

pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
    mutate(Day = wday(Date, label = TRUE)) %>%
      mutate(Day = factor(Day, levels = levels(Day)[c(2:7, 1)])) %>%
        autoplot(box_cox(Count,lambda)) +
          labs(title = paste0("Hourly Pedestrian counts at Southern Cross Station with lambda = ",round(lambda,4))) +
            facet_wrap(Day ~ ., nrow = 4, ncol = 2, scales = "free_y", strip.position = "bottom")

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) %>% select(Gas)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

There are seasonal fluctuations in which Q1 starts low, Q2 ramps up, Q3 peaks and Q4 falls back off in terms of production. There is also an overall year-over-year upward trend showing each year starting higher than the prior year.

gas <- tail(aus_production, 5*4) %>% select(Gas)

autoplot(gas,Gas)

Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas %>%
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical multiplicative decomposition of Australian gas production 2006-2010")

Do the results support the graphical interpretation from part a?

Yes the results support the graphical interpretation of an increasing trend and the shape of the seasonal cycles matches what was inferred from the original graphic.

Compute and plot the seasonally adjusted data.

Saving the first part of the decomposition code above to a variable, it is possible to extract the components of the decomposition from the variable as a tsibble object and then plot the Gas data in blue against the seasonally adjusted trend in red. The seasonal adjustment is similar to but slightly different from the trend in the decomposition above, since the seasonal adjustment is the trend combined with the data in the random component of the decomposition.

gas_dcmp <- gas %>%
  model(
    classical_decomposition(Gas, type = "multiplicative"))

components(gas_dcmp) %>%
  as_tsibble() %>%
    autoplot(Gas, color = "blue") +
    geom_line(aes(y=season_adjust), color = "red")

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

Adding 300 to the 2007Q1 value and copying down the code to compute the seasonally adjusted data causes both the original data in blue and the seasonally adjusted line in red to spike simultaneously at the instance of the outlier.

gas$Gas[7] <- 300

gas_dcmp <- gas %>%
  model(
    classical_decomposition(Gas, type = "multiplicative"))

components(gas_dcmp) %>%
  as_tsibble() %>%
    autoplot(Gas, color = "blue") +
    geom_line(aes(y=season_adjust), color = "red")

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Having the outlier near the end rather than in the middle of the time series does not appear to make much difference. Both the original blue and seasonally adjusted red lines spike in either location.

gas <- tail(aus_production, 5*4) %>% select(Gas)  # reset the data

gas$Gas[18] <- 300

gas_dcmp <- gas %>%
  model(
    classical_decomposition(Gas, type = "multiplicative"))

components(gas_dcmp) %>%
  as_tsibble() %>%
    autoplot(Gas, color = "blue") +
    geom_line(aes(y=season_adjust), color = "red")

Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

This again refers to the aus_retail dataset showing retail turnover from 1982 to 2019. The original series is shown once more below.

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(myseries,Turnover) +
  labs(title = "Turnover by Month")

Decomposing using the X-11 method improves upon the classical method above in terms of showing finer detail in the trend, seasonal and irregular components of the decomposition overall. The trend shows more jagged turns. Also there is a scalloped pattern at the top of the seasonal peaks which is typical of seasonality, and the irregularites such as the one seen just after Jan 2000 are localized to the time period in which they occurred, rather than allowing the randomness to bleed into other time periods as the classical decomposition appears to allow.

retail_dcmp <- myseries %>%
  model(
    x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components()

autoplot(retail_dcmp) +
  labs(title = "Decomposition of Australian retail turnover using X-11")

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The Australian civilian labor force has seen continuous growth from Feb 1978 to Aug 1995 as evidenced by the steadily increasing trend line in the decomposition. The variance in annual seasonal cycles continues to increase year over year as seen in the season_year section, showing slightly lower drops in the bottom of the cycle and higher peaks nearer to the top of each cycle. The remainder of the data indicates random peaks and dips, with a particularly prominent dip in 1991/1992. The scale of the trend matches that of the total value of the labor force data. However, the scale of the seasonality is smaller than the remainder of the data, indicating that seasonality plays a very small role in the overall decomposition.

Is the recession of 1991/1992 visible in the estimated components?

The recession of 1991/1992 is visible in the estimated components, specifically in the remainder section as noted in the above comments. It is also reflected both in the drop in the total value of the labor data and the pleateau in the trend during the same period.

Data 624 Predictive Analytics

Douglas Barley

2/12/2022

Chapter 3, Hyndman and Athanasopoulos