All exercises can be found in “Forecasting: Principles and Practice” written by Rob J Hyndman and George Athanasopoulos.

Setup

The following packages are required to rerun this .rmd file:

library(tidyverse)
library(fpp3)
library(USgas)

Exercise 2.1

Description

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

  • Use ? (or help()) to find out about the data in each series.
  • What is the time interval of each series?
  • Use autoplot() to produce a time plot of each series.
  • For the last plot, modify the axis labels and title.

Solution

Using the help() function, we get the following descriptions of each time series:

help(aus_production)

Description: Quarterly estimates of selected indicators of manufacturing production in Australia.

In particular, the Bricks column tracks the “clay brick production in millions of bricks.”

help(pelt)

Description: Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company.

In particular, the Lynx column tracks the “The number of Canadian Lynx pelts traded.”

help(gafa_stock)

Description: Historical stock prices from 2014-2018 for Google, Amazon, Facebook and Apple. All prices are in $USD.

In particular, the Close column tracks the “closing price for the stock.”

help(vic_elec)

Description: Half-hourly electricity demand for Victoria, Australia.

In particular, the Demand column tracks the “Total electricity demand in MWh.”

The time intervals for each series are as follows.

interval(aus_production)
## <interval[1]>
## [1] 1Q

This represents quarterly time intervals.

interval(pelt)
## <interval[1]>
## [1] 1Y

This represents yearly time intervals.

interval(vic_elec)
## <interval[1]>
## [1] 30m

This represents 30 minute time intervals.

interval(gafa_stock)
## <interval[1]>
## [1] !

Interestingly, gafa_stock does not appear to have a time interval. Upon closer inspection, this is due to the fact that gafa_stock only has rows for days in which the stock market is open:

head(gafa_stock)
## # A tsibble: 6 x 8 [!]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close    Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
## 2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
## 3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
## 4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
## 5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
## 6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200

We see in the above output that the weekend of 2014-01-04 and 2014-01-05 was skipped. Thus while the interval of this tsibble is technically daily, it skips a decent number of days per year.

We can use the autoplot function to create plots of each time series:

plt_data <- aus_production %>%
  select(Quarter, Bricks)

autoplot(plt_data, Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

The plot above shows a repeating seasonality trend in Austrailia’s yearly brick productions (likely less manufacturing during the summer when workers take off). In addition, after 1980 there appears to be an additional repeating pattern in which brick production drops off dramatically and then slowly crawls back up.

plt_data <- pelt %>%
  select(Year, Lynx)

autoplot(plt_data, Lynx)

The graph above shows that there is a definite repeating trend in the number of Lynx pelts traded in Canada during the late late 19th and early 20th centuries. This might be due to a cycle of overhunting of Lynx followed by periods in which hunters wait for their populations to climb via reproduction.

plt_data <- gafa_stock %>%
  select(Date, Close)

autoplot(plt_data, Close)

The plot above shows a pretty recognizable stock market time series that tracks four major stocks over a period of five years. While there are defintiely flucuations in each stock price, the general trend seems to be that these stocks rise over time.

plt_data <- vic_elec %>%
  select(Time, Demand)

autoplot(plt_data, Demand) + labs(
  x = "Year",
  y = 'Electricity Demand (MWh)',
  title = 'Austrailia\'s Energy Demand by Year'
)

The plot above shows a clear distinction between the power needs of Australians in the summer compared to in the winter, with power demand noticeably higher when it is hot out (likely due to A/C). However, there are also thinner, taller spikes in the coldest months of the winter when demand likely increases again due to the need for heat.

Exercise 2.2

Description

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

Solution

First, the cell below determines each of the Symbols in the gafa_stock dataset:

symbols <- gafa_stock %>% 
  distinct(Symbol)

symbols <- symbols$Symbol
symbols
## [1] "AAPL" "AMZN" "FB"   "GOOG"

Next, the cell below determines the date of the maximum closing price for each stock symbol:

# determine the name of the index column 
index_col_name <- index_var(gafa_stock)

# loop through each symbol
for(symbol in symbols){
  # create a tmp tsibble that contains only the rows for a specific symbol
  tmp <- gafa_stock %>%
    filter(Symbol == symbol)
  
  # determine the max closing price
  max_val <- max(tmp$Close)
  
  # filter the data to get the row that contains the max closing price
  peak_time <- tmp %>%
    filter(Close == max_val)
  
  # determine the date of the max closing price
  peak_time <- peak_time[[index_col_name]]
  
  # print the output
  print(paste("On", peak_time, symbol, "hit a max closing price of", max_val))
}
## [1] "On 2018-10-03 AAPL hit a max closing price of 232.070007"
## [1] "On 2018-09-04 AMZN hit a max closing price of 2039.51001"
## [1] "On 2018-07-25 FB hit a max closing price of 217.5"
## [1] "On 2018-07-26 GOOG hit a max closing price of 1268.329956"

Exercise 2.3

Description

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

  1. Read the data into R.
  2. Convert the data to time series
  3. Construct time series plots of each of the three series

Solution

The cell below reads in the data from the csv file into R:

tute <- readr::read_csv("data/tute1.csv")
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (3): Sales, AdBudget, GDP
## date (1): Quarter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tute)
## # A tibble: 6 × 4
##   Quarter    Sales AdBudget   GDP
##   <date>     <dbl>    <dbl> <dbl>
## 1 1981-03-01 1020.     659.  252.
## 2 1981-06-01  889.     589   291.
## 3 1981-09-01  795      512.  291.
## 4 1981-12-01 1004.     614.  292.
## 5 1982-03-01 1058.     647.  279.
## 6 1982-06-01  944.     602   254

The cell below converts tute into a tsibble object with Quarter as the field index. The [1Q] in the output shows that we have converted the data sucessfully:

tute <- tute %>%
  # use the Quarter field as the index field
  mutate(Quarter = yearquarter(Quarter)) %>%
    as_tsibble(index=Quarter)

Finally, the cell below creates time series plots for each of the three series in the tsibble:

tute %>%
  # convert data into long format so it can be plotted
  pivot_longer(-Quarter) %>%
    # plot the data
    ggplot(aes(x=Quarter, y=value, color=name)) +
      geom_line()

The cell below creates the same plot as above, but this time uses a facet_grid so that each column has its own plot:

tute %>%
  # convert data into long format so it can be plotted
  pivot_longer(-Quarter) %>%
    # plot the data
    ggplot(aes(x=Quarter, y=value, color=name)) +
      geom_line() +
        facet_grid(name ~ ., scales = "free_y")

Exercise 2.4

Description

The USgas package contains data on the demand for natural gas in the US.

  1. Create a tsibble from us_total with year as the index and state as the key.
  2. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

Solution

The cell below creates a tsibble from us_gas, showing the natural gas consumption by year for each state in the US.

us_gas <- us_total %>%
  as_tsibble(index=year,
             key=state)
head(us_gas)
## # A tsibble: 6 x 3 [1Y]
## # Key:       state [1]
##    year state        y
##   <int> <chr>    <int>
## 1  1997 Alabama 324158
## 2  1998 Alabama 329134
## 3  1999 Alabama 337270
## 4  2000 Alabama 353614
## 5  2001 Alabama 332693
## 6  2002 Alabama 379343

The cell below plots the gas consumption for the New England states (Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island):

NE_states <- c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts',
               'Connecticut', 'Rhode Island')

us_gas %>%
  filter(state %in% NE_states) %>%
    ggplot(aes(x=year, y=y, color=state)) +
      geom_line() +
        labs(title='Natural Gas Consumption in New England',
             y='Gas Consumption (Cubic Feet x 10^6)',
             x='Year')

Exercise 2.5

Description

  1. Download tourism.xlsx from the book website and read it into R using readxl::read_excel().
  2. Create a tsibble which is identical to the tourism tsibble from the tsibble package.
  3. Find what combination of Region and Purpose had the maximum number of overnight trips on average.
  4. Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

Solution

The cell below reads in the xlsx file as a R dataframe:

tourism_xl <- readxl::read_excel('data/tourism.xlsx')
head(tourism_xl)
## # A tibble: 6 × 5
##   Quarter    Region   State           Purpose  Trips
##   <chr>      <chr>    <chr>           <chr>    <dbl>
## 1 1998-01-01 Adelaide South Australia Business  135.
## 2 1998-04-01 Adelaide South Australia Business  110.
## 3 1998-07-01 Adelaide South Australia Business  166.
## 4 1998-10-01 Adelaide South Australia Business  127.
## 5 1999-01-01 Adelaide South Australia Business  137.
## 6 1999-04-01 Adelaide South Australia Business  200.

Based on the output above, the data appears to show the number of trips taken in each state of Australia.

The cell below converts the imported tourism_xl into a tsibble:

tourism_xl_ts <- tourism_xl %>%
  mutate(Quarter = yearquarter(Quarter)) %>%
    as_tsibble(index=Quarter, key=c(Region, State, Purpose))

We can now compare tourism_xl against the tourism tsibble from the ffp3 package to ensure that they are the same:

all.equal(tourism_xl_ts, tourism)
## [1] TRUE

As we can see from the above, the two tsibbles do indeed contain the same information.

Next, the cell below determines which combination of Region and Purpose result in the largest average number of trips taken.

tourism_xl %>%
  # group by the specified columns 
  group_by(Region, Purpose) %>%
    # calculate the average number of trips
    summarise(avg_trips = mean(Trips, na.rm=TRUE)) %>%
      # order by the average number of trips
      arrange(desc(avg_trips)) %>%
        # ungroup so that we can correctly slice data
        ungroup() %>%
          # only take the first row, with the highest avg_trips amount
          slice(1)
## # A tibble: 1 × 3
##   Region Purpose  avg_trips
##   <chr>  <chr>        <dbl>
## 1 Sydney Visiting      747.

Based on the output above, those visiting Sydney represents the largest number of average trips over the time period included in tourism_xl.

Finally, the cell below creates a new time series that only shows the number of trips by state (removing Region and Reason).

tourism_by_state <- tourism_xl %>%
  # group by specified values
  group_by(Quarter, State) %>%
    # calculate total number of trips by quarter for each state
    summarise(Trips = sum(Trips)) %>%
      # mutate quarter column to prep for tsibble conversion
      mutate(Quarter = yearquarter(Quarter)) %>%
        # convert to tsibble
        as_tsibble(index=Quarter,
                   key=State)

head(tourism_by_state)
## # A tsibble: 6 x 3 [1Q]
## # Key:       State [1]
## # Groups:    @ Quarter [6]
##   Quarter State Trips
##     <qtr> <chr> <dbl>
## 1 1998 Q1 ACT    551.
## 2 1998 Q2 ACT    416.
## 3 1998 Q3 ACT    436.
## 4 1998 Q4 ACT    450.
## 5 1999 Q1 ACT    379.
## 6 1999 Q2 ACT    558.

To confirm this has been done correctly, the cell below spot checks the first row (1998 Q1, ACT):

tmp <- tourism_xl %>% 
  filter(Quarter == '1998-01-01' & State == 'ACT')
sum(tmp$Trips)
## [1] 551.0019

This value does indeed match the first value seen in tourism_by_state.

Exercise 2.8

Description

Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: Employed from us_employment, Bricks from aus_production, Hare from pelt, Cost from PBS, and Barrels from us_gasoline.

  • Can you spot any seasonality, cyclicity and trend?
  • What do you learn about the series?
  • What can you say about the seasonal patterns?
  • Can you identify any unusual years?

Solution

gg_season(aus_production, y=Bricks,  polar = TRUE)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).