All exercises can be found in “Forecasting: Principles and Practice” written by Rob J Hyndman and George Athanasopoulos.
The following packages are required to rerun this .rmd
file:
library(tidyverse)
library(fpp3)
library(USgas)
Explore the following four time series: Bricks from
aus_production, Lynx from pelt,
Close from gafa_stock, Demand
from vic_elec.
? (or help()) to find out about the
data in each series.autoplot() to produce a time plot of each
series.Using the help() function, we get the following
descriptions of each time series:
help(aus_production)
Description: Quarterly estimates of selected indicators of manufacturing production in Australia.
In particular, the Bricks column tracks the “clay brick
production in millions of bricks.”
help(pelt)
Description: Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company.
In particular, the Lynx column tracks the “The number of
Canadian Lynx pelts traded.”
help(gafa_stock)
Description: Historical stock prices from 2014-2018 for Google, Amazon, Facebook and Apple. All prices are in $USD.
In particular, the Close column tracks the “closing
price for the stock.”
help(vic_elec)
Description: Half-hourly electricity demand for Victoria, Australia.
In particular, the Demand column tracks the “Total
electricity demand in MWh.”
The time intervals for each series are as follows.
interval(aus_production)
## <interval[1]>
## [1] 1Q
This represents quarterly time intervals.
interval(pelt)
## <interval[1]>
## [1] 1Y
This represents yearly time intervals.
interval(vic_elec)
## <interval[1]>
## [1] 30m
This represents 30 minute time intervals.
interval(gafa_stock)
## <interval[1]>
## [1] !
Interestingly, gafa_stock does not appear to have a time
interval. Upon closer inspection, this is due to the fact that
gafa_stock only has rows for days in which the stock market
is open:
head(gafa_stock)
## # A tsibble: 6 x 8 [!]
## # Key: Symbol [1]
## Symbol Date Open High Low Close Adj_Close Volume
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 58671200
## 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 98116900
## 3 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 103152700
## 4 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 79302300
## 5 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 64632400
## 6 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 69787200
We see in the above output that the weekend of 2014-01-04 and
2014-01-05 was skipped. Thus while the interval of this
tsibble is technically daily, it skips a decent number of
days per year.
We can use the autoplot function to create plots of each
time series:
plt_data <- aus_production %>%
select(Quarter, Bricks)
autoplot(plt_data, Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
The plot above shows a repeating seasonality trend in Austrailia’s yearly brick productions (likely less manufacturing during the summer when workers take off). In addition, after 1980 there appears to be an additional repeating pattern in which brick production drops off dramatically and then slowly crawls back up.
plt_data <- pelt %>%
select(Year, Lynx)
autoplot(plt_data, Lynx)
The graph above shows that there is a definite repeating trend in the number of Lynx pelts traded in Canada during the late late 19th and early 20th centuries. This might be due to a cycle of overhunting of Lynx followed by periods in which hunters wait for their populations to climb via reproduction.
plt_data <- gafa_stock %>%
select(Date, Close)
autoplot(plt_data, Close)
The plot above shows a pretty recognizable stock market time series that tracks four major stocks over a period of five years. While there are defintiely flucuations in each stock price, the general trend seems to be that these stocks rise over time.
plt_data <- vic_elec %>%
select(Time, Demand)
autoplot(plt_data, Demand) + labs(
x = "Year",
y = 'Electricity Demand (MWh)',
title = 'Austrailia\'s Energy Demand by Year'
)
The plot above shows a clear distinction between the power needs of Australians in the summer compared to in the winter, with power demand noticeably higher when it is hot out (likely due to A/C). However, there are also thinner, taller spikes in the coldest months of the winter when demand likely increases again due to the need for heat.
Use filter() to find what days corresponded to the peak
closing price for each of the four stocks in
gafa_stock.
First, the cell below determines each of the Symbols in the
gafa_stock dataset:
symbols <- gafa_stock %>%
distinct(Symbol)
symbols <- symbols$Symbol
symbols
## [1] "AAPL" "AMZN" "FB" "GOOG"
Next, the cell below determines the date of the maximum closing price for each stock symbol:
# determine the name of the index column
index_col_name <- index_var(gafa_stock)
# loop through each symbol
for(symbol in symbols){
# create a tmp tsibble that contains only the rows for a specific symbol
tmp <- gafa_stock %>%
filter(Symbol == symbol)
# determine the max closing price
max_val <- max(tmp$Close)
# filter the data to get the row that contains the max closing price
peak_time <- tmp %>%
filter(Close == max_val)
# determine the date of the max closing price
peak_time <- peak_time[[index_col_name]]
# print the output
print(paste("On", peak_time, symbol, "hit a max closing price of", max_val))
}
## [1] "On 2018-10-03 AAPL hit a max closing price of 232.070007"
## [1] "On 2018-09-04 AMZN hit a max closing price of 2039.51001"
## [1] "On 2018-07-25 FB hit a max closing price of 217.5"
## [1] "On 2018-07-26 GOOG hit a max closing price of 1268.329956"
Download the file tute1.csv from the book
website, open it in Excel (or some other spreadsheet application), and
review its contents. You should find four columns of information.
Columns B through D each contain a quarterly series, labelled
Sales, AdBudget and GDP.
Sales contains the quarterly sales for a small company over
the period 1981-2005. AdBudget is the advertising budget
and GDP is the gross domestic product. All series have been
adjusted for inflation.
The cell below reads in the data from the csv file into
R:
tute <- readr::read_csv("data/tute1.csv")
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Sales, AdBudget, GDP
## date (1): Quarter
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tute)
## # A tibble: 6 × 4
## Quarter Sales AdBudget GDP
## <date> <dbl> <dbl> <dbl>
## 1 1981-03-01 1020. 659. 252.
## 2 1981-06-01 889. 589 291.
## 3 1981-09-01 795 512. 291.
## 4 1981-12-01 1004. 614. 292.
## 5 1982-03-01 1058. 647. 279.
## 6 1982-06-01 944. 602 254
The cell below converts tute into a tsibble
object with Quarter as the field index. The
[1Q] in the output shows that we have converted the data
sucessfully:
tute <- tute %>%
# use the Quarter field as the index field
mutate(Quarter = yearquarter(Quarter)) %>%
as_tsibble(index=Quarter)
Finally, the cell below creates time series plots for each of the
three series in the tsibble:
tute %>%
# convert data into long format so it can be plotted
pivot_longer(-Quarter) %>%
# plot the data
ggplot(aes(x=Quarter, y=value, color=name)) +
geom_line()
The cell below creates the same plot as above, but this time uses a
facet_grid so that each column has its own plot:
tute %>%
# convert data into long format so it can be plotted
pivot_longer(-Quarter) %>%
# plot the data
ggplot(aes(x=Quarter, y=value, color=name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
The USgas package contains data on the demand for
natural gas in the US.
tsibble from us_total with
year as the index and state as the key.The cell below creates a tsibble from
us_gas, showing the natural gas consumption by year for
each state in the US.
us_gas <- us_total %>%
as_tsibble(index=year,
key=state)
head(us_gas)
## # A tsibble: 6 x 3 [1Y]
## # Key: state [1]
## year state y
## <int> <chr> <int>
## 1 1997 Alabama 324158
## 2 1998 Alabama 329134
## 3 1999 Alabama 337270
## 4 2000 Alabama 353614
## 5 2001 Alabama 332693
## 6 2002 Alabama 379343
The cell below plots the gas consumption for the New England states (Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island):
NE_states <- c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts',
'Connecticut', 'Rhode Island')
us_gas %>%
filter(state %in% NE_states) %>%
ggplot(aes(x=year, y=y, color=state)) +
geom_line() +
labs(title='Natural Gas Consumption in New England',
y='Gas Consumption (Cubic Feet x 10^6)',
x='Year')
tourism.xlsx from the
book website and read it into R using
readxl::read_excel().tsibble which is identical to the tourism
tsibble from the tsibble package.tsibble which combines the Purposes and
Regions, and just has total trips by State.The cell below reads in the xlsx file as a R
dataframe:
tourism_xl <- readxl::read_excel('data/tourism.xlsx')
head(tourism_xl)
## # A tibble: 6 × 5
## Quarter Region State Purpose Trips
## <chr> <chr> <chr> <chr> <dbl>
## 1 1998-01-01 Adelaide South Australia Business 135.
## 2 1998-04-01 Adelaide South Australia Business 110.
## 3 1998-07-01 Adelaide South Australia Business 166.
## 4 1998-10-01 Adelaide South Australia Business 127.
## 5 1999-01-01 Adelaide South Australia Business 137.
## 6 1999-04-01 Adelaide South Australia Business 200.
Based on the output above, the data appears to show the number of trips taken in each state of Australia.
The cell below converts the imported tourism_xl into a
tsibble:
tourism_xl_ts <- tourism_xl %>%
mutate(Quarter = yearquarter(Quarter)) %>%
as_tsibble(index=Quarter, key=c(Region, State, Purpose))
We can now compare tourism_xl against the
tourism tsibble from the ffp3
package to ensure that they are the same:
all.equal(tourism_xl_ts, tourism)
## [1] TRUE
As we can see from the above, the two tsibbles do indeed
contain the same information.
Next, the cell below determines which combination of
Region and Purpose result in the largest
average number of trips taken.
tourism_xl %>%
# group by the specified columns
group_by(Region, Purpose) %>%
# calculate the average number of trips
summarise(avg_trips = mean(Trips, na.rm=TRUE)) %>%
# order by the average number of trips
arrange(desc(avg_trips)) %>%
# ungroup so that we can correctly slice data
ungroup() %>%
# only take the first row, with the highest avg_trips amount
slice(1)
## # A tibble: 1 × 3
## Region Purpose avg_trips
## <chr> <chr> <dbl>
## 1 Sydney Visiting 747.
Based on the output above, those visiting Sydney represents the
largest number of average trips over the time period included in
tourism_xl.
Finally, the cell below creates a new time series that only shows the number of trips by state (removing Region and Reason).
tourism_by_state <- tourism_xl %>%
# group by specified values
group_by(Quarter, State) %>%
# calculate total number of trips by quarter for each state
summarise(Trips = sum(Trips)) %>%
# mutate quarter column to prep for tsibble conversion
mutate(Quarter = yearquarter(Quarter)) %>%
# convert to tsibble
as_tsibble(index=Quarter,
key=State)
head(tourism_by_state)
## # A tsibble: 6 x 3 [1Q]
## # Key: State [1]
## # Groups: @ Quarter [6]
## Quarter State Trips
## <qtr> <chr> <dbl>
## 1 1998 Q1 ACT 551.
## 2 1998 Q2 ACT 416.
## 3 1998 Q3 ACT 436.
## 4 1998 Q4 ACT 450.
## 5 1999 Q1 ACT 379.
## 6 1999 Q2 ACT 558.
To confirm this has been done correctly, the cell below spot checks the first row (1998 Q1, ACT):
tmp <- tourism_xl %>%
filter(Quarter == '1998-01-01' & State == 'ACT')
sum(tmp$Trips)
## [1] 551.0019
This value does indeed match the first value seen in
tourism_by_state.
Use the following graphics functions: autoplot(),
gg_season(), gg_subseries(),
gg_lag(), ACF() and explore features from the
following time series: Employed from
us_employment, Bricks from
aus_production, Hare from pelt,
Cost from PBS, and Barrels from
us_gasoline.
gg_season(aus_production, y=Bricks, polar = TRUE)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).