624 Homework 2

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble     1.1.6     ✔ feasts      0.4.1
## ✔ tsibbledata 0.4.1     ✔ fable       0.4.1

## Warning: package 'tsibble' was built under R version 4.4.2

## Warning: package 'tsibbledata' was built under R version 4.4.2

## Warning: package 'feasts' was built under R version 4.4.2

## Warning: package 'fabletools' was built under R version 4.4.2

## Warning: package 'fable' was built under R version 4.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tsibble)
library(seasonal)

## Warning: package 'seasonal' was built under R version 4.4.2

## 
## Attaching package: 'seasonal'
## 
## The following object is masked from 'package:tibble':
## 
##     view

library(slider)

## Warning: package 'slider' was built under R version 4.4.2

#Introduction

In this homework assignment I will be submitting exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the Hyndman online Forecasting book.

Question 3.1

Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

# Viewing the global_economy information
#help("global_economy")

# Loading the global economy data with the mutated gdp per capita data, and selecting only the mutated column
GDP <- global_economy %>%
  mutate(gdp_per_capita = GDP/Population) %>%
  select(gdp_per_capita)

# Viewing the data
head(GDP)

## # A tsibble: 6 x 3 [1Y]
## # Key:       Country [1]
##   gdp_per_capita  Year Country    
##            <dbl> <dbl> <fct>      
## 1           59.8  1960 Afghanistan
## 2           59.9  1961 Afghanistan
## 3           58.5  1962 Afghanistan
## 4           78.8  1963 Afghanistan
## 5           82.2  1964 Afghanistan
## 6          101.   1965 Afghanistan

# Removing the legened to see the plot of GDP per capita for each country over time.
autoplot(GDP, gdp_per_capita, show.legend =  FALSE)

## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Finding the country with the highest gdp per capita in the most recent years found in the data

# Finding the three most recent years in our dataset (2017, 2016, and 2015)
recent_years <- unique(GDP$Year) %>% tail(3)

# Finding the country with highest gdp per capita in 2017
GDP %>%
  filter(Year == 2017) %>%
  arrange(desc(gdp_per_capita))

## # A tsibble: 262 x 3 [1Y]
## # Key:       Country [262]
##    gdp_per_capita  Year Country         
##             <dbl> <dbl> <fct>           
##  1        104103.  2017 Luxembourg      
##  2         80893.  2017 Macao SAR, China
##  3         80190.  2017 Switzerland     
##  4         75505.  2017 Norway          
##  5         70057.  2017 Iceland         
##  6         69331.  2017 Ireland         
##  7         63249.  2017 Qatar           
##  8         59532.  2017 United States   
##  9         58070.  2017 North America   
## 10         57714.  2017 Singapore       
## # ℹ 252 more rows

# Finding the country with highest gdp per capita in 2016
GDP %>%
  filter(Year == 2016) %>%
  arrange(desc(gdp_per_capita))

## # A tsibble: 262 x 3 [1Y]
## # Key:       Country [262]
##    gdp_per_capita  Year Country         
##             <dbl> <dbl> <fct>           
##  1        168011.  2016 Monaco          
##  2        164993.  2016 Liechtenstein   
##  3        100739.  2016 Luxembourg      
##  4         79866.  2016 Switzerland     
##  5         78730.  2016 Isle of Man     
##  6         74017.  2016 Macao SAR, China
##  7         70890.  2016 Norway          
##  8         64100.  2016 Ireland         
##  9         60530.  2016 Iceland         
## 10         59044.  2016 Qatar           
## # ℹ 252 more rows

# Finding the country with highest gdp per capita in 2015
GDP %>%
  filter(Year == 2015) %>%
  arrange(desc(gdp_per_capita))

## # A tsibble: 262 x 3 [1Y]
## # Key:       Country [262]
##    gdp_per_capita  Year Country         
##             <dbl> <dbl> <fct>           
##  1        167591.  2015 Liechtenstein   
##  2        163369.  2015 Monaco          
##  3        101447.  2015 Luxembourg      
##  4         82016.  2015 Switzerland     
##  5         81672.  2015 Isle of Man     
##  6         75484.  2015 Macao SAR, China
##  7         74498.  2015 Norway          
##  8         65177.  2015 Qatar           
##  9         61808.  2015 Ireland         
## 10         56561.  2015 Australia       
## # ℹ 252 more rows

In the most recent data we have (2017), Luxembourg was the country with the highest GDP per capita at $104,103.0367 USD. In 2016 and 2015 Luxembourg was third highest, so we can say that they seem to have consistently high GDP per capita, at least in recent history. Macao SAR, China has recently started climbing from top six to top two in 2017, whereas Monaco which was a strong GDP per capita country is not in the top ten anymore as of 2017.

Question 3.2

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

United States GDP from global_economy. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. Victorian Electricity Demand from vic_elec. Gas production from aus_production.

# Loading the US gdp data and mutating in gdp per capita data as a population transformation 

us.gdp <- global_economy %>%
  mutate(gdp_per_capita = GDP/Population) %>%
  select(GDP, gdp_per_capita) %>%
  filter(Country == "United States")
head(us.gdp)

## # A tsibble: 6 x 4 [1Y]
## # Key:       Country [1]
##            GDP gdp_per_capita  Year Country      
##          <dbl>          <dbl> <dbl> <fct>        
## 1 543300000000          3007.  1960 United States
## 2 563300000000          3067.  1961 United States
## 3 605100000000          3244.  1962 United States
## 4 638600000000          3375.  1963 United States
## 5 685800000000          3574.  1964 United States
## 6 743700000000          3828.  1965 United States

us.gdp %>% autoplot(GDP)

us.gdp %>% autoplot(gdp_per_capita)

The effect of adjusting the US gdp data for population did not have a major impact on our vizualizations. The comparison of the adjusted data to the unadjusted data, however, is important for this reason: if there is a material difference in the two plots we could hypothesize different explainations and research further. For example is the gpd per capita had a lesser or decreasing trend and the gdp had an strong positive trend we could explore the possibility of an increasing younger population, or greater returns for the same production of products due to political interplay, etc.

#help("aus_livestock")

# Loading in tsibble of bulls bullocks and steers slaughtered in Victoria
aus_livestock <- aus_livestock

livestock <- aus_livestock %>%
  filter(Animal == "Bulls, bullocks and steers",
         State == "Victoria")
head(livestock)

## # A tsibble: 6 x 4 [1M]
## # Key:       Animal, State [1]
##      Month Animal                     State     Count
##      <mth> <fct>                      <fct>     <dbl>
## 1 1976 Jul Bulls, bullocks and steers Victoria 109200
## 2 1976 Aug Bulls, bullocks and steers Victoria  94700
## 3 1976 Sep Bulls, bullocks and steers Victoria  95500
## 4 1976 Oct Bulls, bullocks and steers Victoria  94800
## 5 1976 Nov Bulls, bullocks and steers Victoria  94100
## 6 1976 Dec Bulls, bullocks and steers Victoria  98300

livestock %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

No adjustment was done to the plot of slaughtering of bulls, bullocks and steers, however if the dataset contained daily indexing an average slaughter count per day could be calculated to do a calendar adjustment - to account for variation per month due to the different number of days in a month.

#help("vic_elec")

vic_elec %>%
  autoplot(Demand)

No transformation was done to the demand plot from vic_electric, a transformation to aggregate the half hour demand into daily demand may be appropriate depending on the direction of the anaysis, but this aggregation would miss out on any daily seasonality only available with less than daily data frequency.

#help(aus_production)

aus_production %>%
  autoplot(Gas)

# Making a mathematical box-cox adjustement due to the change in varation of the data

lambda <- aus_production |>
  features(Gas, features = guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Gas, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda,2))))

Variance over time is standardized with in this adjustment with the guerror feature choosing lamba = 0.11.

Question 3.3

Why is a Box-Cox transformation unhelpful for the canadian_gas data?

# Viewing the canadian_gas data over time
autoplot(canadian_gas, Volume)

The gas production data shows a variance which increases and decreases over time. Because of the cyclic nature of the variance, standardizing the variance would be in essence hiding the cyclic nature when that cyclic nature might be a feature of the data to be explored. Standardizing the variance with box-cox is a good choice when the variance is steadily changing overtime.

Question 3.4

What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?

set.seed(123678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

#help("aus_retail")

myseries %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

The plot of retail data above shows variance increasing with time so I will do a guerrero box-cox transformation to standardize the variance.

lambda_retail <- myseries %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)

myseries %>%
  autoplot(box_cox(Turnover, lambda_retail))+
  labs(title = latex2exp::TeX(paste0(
         "Box Cox Transformation of Australian Retail Turnover with $\\lambda$ = ",
         round(lambda_retail,2))))

The positive lambda value reduces the right skewness of the data, and the non zero lambda indicates a square root transformation.

Question 3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

aus_production %>%
  autoplot(Tobacco)+
  labs(title = "Tobacco")

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

lambda_tobacco <- aus_production %>%
                   features(Tobacco, features = guerrero) %>%
                   pull(lambda_guerrero)

aus_production %>%
  autoplot(box_cox(Tobacco, lambda_tobacco)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Tobacco with $\\lambda$ = ",
         round(lambda_tobacco,2))))

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

The box-cox transformation here uses lambda 0.93

ansett %>%
  filter(Class == "Economy",Airports == "MEL-SYD") %>%
  mutate(Passengers = Passengers/1000) %>%
  autoplot(Passengers)

lambda_class <- ansett %>%
                 filter(Class == "Economy",
                        Airports == "MEL-SYD") %>%
                 features(Passengers, features = guerrero) %>%
                 pull(lambda_guerrero)

ansett %>%
  filter(Class == "Economy",
         Airports == "MEL-SYD") %>%
  mutate(Passengers = Passengers/1000) %>%
  autoplot(box_cox(Passengers, lambda = lambda_class)) +
  labs(y = "Passengers ('000)",
       title = latex2exp::TeX(paste0(
         "Transformed Ansett Airlines Economy Class with $\\lambda$ = ",
         round(lambda_class,2))))

The box-cox transformation here uses lambda 2.0

pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
  autoplot(Count) +
  labs(title = "Pedestrian Counts at Southern Cross Station")

lambda_count <- pedestrian %>%
                filter(Sensor == "Southern Cross Station") %>%
                 features(Count, features = guerrero) %>%
                 pull(lambda_guerrero)

pedestrian %>%
  filter(Sensor == "Southern Cross Station") %>%
  autoplot(box_cox(Count,lambda_count))+
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Pedestrian Counts at Southern Cross Station with $\\lambda$ = ",
         round(lambda_count,2))))

The box-cox transformation here uses lambda -0.23

Question 3.6

Show that a 3 × 5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

# Showing the results of a 3 x 5 MA with the beer data from aus_production.

beer <- aus_production |>
  filter(year(Quarter) >= 1992) |>
  select(Quarter, Beer)

beer_ma <- beer |>
  mutate(
    `5-MA` = slider::slide_dbl(Beer, mean,
                .before = 2, .after = 2, .complete = TRUE),
    `3x5-MA` = slider::slide_dbl(`5-MA`, mean,
                .before = 1, .after = 1, .complete = TRUE)
  )

head(beer_ma,7)

## # A tsibble: 7 x 4 [1Q]
##   Quarter  Beer `5-MA` `3x5-MA`
##     <qtr> <dbl>  <dbl>    <dbl>
## 1 1992 Q1   443    NA       NA 
## 2 1992 Q2   410    NA       NA 
## 3 1992 Q3   420   448.      NA 
## 4 1992 Q4   532   443.     445.
## 5 1993 Q1   433   443.     449.
## 6 1993 Q2   421   462.     450.
## 7 1993 Q3   410   445      447.

# Showing the results of a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067 on the same data as the above demonstration of the 3 X 5-MA.

weights <- c(0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067)

# Defining a function for weighted moving average

weighted_ma <- function(x, weights) {
  if (length(x) < length(weights)) return(NA_real_)
  sum(x * weights, na.rm = TRUE)
}

# Applying the weighted moving average to the data
beer_weightedMA <- beer %>%
  mutate(ma_7 = slide_dbl(
    .x = Beer,
    .f = ~weighted_ma(.x, weights),
    .before = 3,
    .after = 3
  ))

head(beer_weightedMA, 7)

## # A tsibble: 7 x 3 [1Q]
##   Quarter  Beer  ma_7
##     <qtr> <dbl> <dbl>
## 1 1992 Q1   443   NA 
## 2 1992 Q2   410   NA 
## 3 1992 Q3   420   NA 
## 4 1992 Q4   532  445.
## 5 1993 Q1   433  449.
## 6 1993 Q2   421  450.
## 7 1993 Q3   410  447.

The above results of beer_ma and beer_weightedMA are shown above - they are the very similar results, but may differ I assume due to rounding.

Question 3.7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
Do the results support the graphical interpretation from part a?
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series?

A. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

gas <- tail(aus_production, 5*4) |> select(Gas)

#Plotting the time series
autoplot(gas,Gas)

The trend-cycle seems to be a weak positive trend, with no cyclicity seen in the data available. There is very strong seasonality with peaks in Q2 and Q3 and lows in Q1 and Q4.

B. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

# Viewing the components 
gas.multiplicative <- gas %>%
  model(classical_decomposition(Gas,type = "multiplicative")) %>%
  components()

# Viewing the component plots
gas.multiplicative %>% autoplot()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

mean(gas.multiplicative$seasonal)

## [1] 1

The seasonality calculated with multiplicative decomposition is 1.

mean(gas.multiplicative$trend, na.rm=TRUE)

## [1] 216.3203

The trend cycle calculated with multiplicative decomposition is 216.3203

C.Do the results support the graphical interpretation from part a?

The calculated seasonality is 1, which is reflective of what we see in part a, the seasonality seen is per year. We obtained a positive trend, which is also what is seen in part a’s plot.

D. Compute and plot the seasonally adjusted data.

# Plotting seasonaly adjusted data
gas %>%
  model(
    classical_decomposition(Gas, type = 'multiplicative')
  ) %>%
  components() %>%
  ggplot(aes(x=Quarter)) +
  geom_line(aes(y = season_adjust))

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas300 <- gas
gas300$Gas[1] <- gas300$Gas[1]+300

gas300 %>%
  model(
    classical_decomposition(Gas, type = 'multiplicative')
  ) %>%
  components() %>%
  ggplot(aes(x=Quarter)) +
  geom_line(aes(y = season_adjust))

In the seasonally adjusted data the outlier added in remains shown on the plot. This is expected because the outlier is not part of the seasonality of the data.

F. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas300 <- gas
gas300$Gas[10] <- gas300$Gas[10]+300

gas300 %>%
  model(
    classical_decomposition(Gas, type = 'multiplicative')
  ) %>%
  components() %>%
  ggplot(aes(x=Quarter)) +
  geom_line(aes(y = season_adjust))

In the example where the outlier is at the end of the data, the outlier seems to flatten out the seasonally adjusted data more than if the outlier is in the center. It seems that the outlier may be affecting the mean of either the peak or the troughs of the seasonality and changing the calculations of how much variation is calculated into the seasonality.

Question 3.8

Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

myseries %>%
  model(x11 = X_13ARIMA_SEATS(Turnover ~ x11())) %>%
  components() %>%
  autoplot()

The X-11 decomposition method reveals to me a spike in Residual effect shortly prior to 1990, and a variably changing seasonality overtime where before I had assumed the seasonality was steadily increasing.

Question 3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

At first glance, based on scaling, I would have assumed that the seasonality of the data has a much greater affect on the value plot in the STL decomposition. When taking a closer look we can see that the fluctuations in seasonality only have about a magnitude of 100 to -100, where as the value is shown to go from about 6000 to 9000. The scale of the remainder is about the same as the seasonality with an outlier present. This means that most of the change in value is due to the trend-cycle.

Is the recession of 1991/1992 visible in the estimated components?

The recession of 1991/1992 is visible as an outlier in the remainder component of the plots. This makes sense because the recession was not due to seasonality or standard cyclic nature, but likely other factors.