Exercises from Hyndman and Athanosopoulos, Forecasting: Principles and Practice, 3rd Edition, Chapter 2 Time series graphics.
library(fpp3)Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent.
autoplot() to plot some of the series in these data sets.We examine the documentation of each dataset using ?<dataset-name> The interval command informs us of the interval mapped to each tsibble object.
The command ?gafa_stock informs us this dataset contains historical stock prices for four technology companies Google, Amazon, Facebook and Apple for 2014-2018. We note that Adj_Close is the best quantitative price series to use because it mitigates the effects of dividends and stock splits. That is, the adjustment is required to alow valid comparison of performance of a stock across long periods of time when corporate actions like dividends or stock splits may occur.
autoplot(gafa_stock, Adj_Close) +
labs(title="Stock prices from 2014-2018 in $USD",
subtitle="Google, Amazon, Facebook, Apple" ,
y = "Price in $USD"
)The R command below shows that the interval of gafa_stock is irregular.
interval(gafa_stock)## <interval[1]>
## [1] !
PBS is a tsibble of monthly Medicare Australian prescription data. The index is the column Month and the key is 4 tuple of columns: Concession, Type, ATC1, and ATC2. The chart below shows 4 time series associated with PBS dataset associated with 4 combinations of attributes related to Concession and Type. We plot the number of Scripts below for Analgesics.
We observe an increasing trend and annual seasonality in all 4 time series. Note that the interval is 1M. We also note that seasonal variation increases over the years.
interval(PBS)## <interval[1]>
## [1] 1M
PBS %>% filter( ATC2_desc == "ANALGESICS" ) %>%
select(Month, Concession, Type, ATC1, ATC1_desc,
ATC2, ATC2_desc, Cost, Scripts) -> Analgesics
Analgesics %>% autoplot( Scripts ) +
labs(title="Analgesics Prescriptions Volume",
subtitle="AT2: N02", y = "Number of Scripts" )Analgesics %>% autoplot( Cost ) +
labs(title="Analgesics Prescriptions Cost",
subtitle="ATC2: N02", y = "Cost of Scripts" )vic_elec represents half-hourly demand for electricity in Victoria, Australia. The tsibble interval is 30min. There is no key column. So the Time column uniquely identifies each observation. The numeric columns of interest are Demand (in MW) and Temperature in (Celsius).
We autoplot the demand and temperature as time series. Both series appear to show no trend but daily and annual seasonality. There may be cyclicality but I cannot see it by visual inspection.
interval(vic_elec)## <interval[1]>
## [1] 30m
vic_elec %>% select( Time, Demand) %>% autoplot(Demand) +
labs(title="Half-Hourly Demand for Electric",
subtitle = "Victoria, Australia",
y = "MegaWatts")vic_elec %>% select( Time, Temperature) %>% autoplot(Temperature) +
labs(title="Half-Hourly Temperature",
subtitle = "Victoria, Australia",
y = "Celsius")pelt represents annual pelt trading for two species: Hare and Lynx from 1845 to 1935. The interval is annual. There is no key so each row is uniquely identified by year. However, we pivot the tsibble to a longer format by adding a species and count column to make the data easier to plot.
We observe cyclicality in the pelt count which may indicate a prey-predator relation that may cause Hare and Lynx populations to oscillate. A high count of lynxes is usually followed by a drop-off in Hare populations in later years. We don’t observe seasonality because the oscillations don’t follow a fixed period of time.
interval(pelt)## <interval[1]>
## [1] 1Y
pivot_longer(pelt, !Year, names_to="species", values_to = 'count' ) %>% autoplot() +
labs(title="Pelts Trade by Species",
subtitle = "Hudson Bay Company Records" ,
y = "Number of Pelts traded" )## Plot variable not specified, automatically selected `.vars = count`
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.
gafa_stock %>% group_by(Symbol ) %>%
mutate( max_price = max( Adj_Close)) %>%
filter( Adj_Close == max_price) %>%
select( Symbol, Date, Close, Adj_Close, Volume )## # A tsibble: 4 x 5 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date Close Adj_Close Volume
## <chr> <date> <dbl> <dbl> <dbl>
## 1 AAPL 2018-10-03 232. 230. 28654800
## 2 AMZN 2018-09-04 2040. 2040. 5721100
## 3 FB 2018-07-25 218. 218. 58954200
## 4 GOOG 2018-07-26 1268. 1268. 2405600
The days of peak closing price should be in terms of the adjusted close - not the nominal close – to be economically accurate. However, this makes little practical difference for this dataset. The peak price dates for AAPL is 2018-10-03, for AMZN 2018-09-04, etc, as shown in tsibble output above.
Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
tute1 <- readr::read_csv("tute1.csv")
View(tute1)mytimeseries <- tute1 %>%
mutate(Quarter = yearmonth(Quarter)) %>%
as_tsibble(index = Quarter)mytimeseries %>%
pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")Check what happens when you don’t include facet_grid().
The code chunks in the above problem statement can be used to create a tsibble object with the required data.
tute1 <- readr::read_csv("tute1.csv")##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Quarter = col_date(format = ""),
## Sales = col_double(),
## AdBudget = col_double(),
## GDP = col_double()
## )
mytimeseries <- tute1 %>%
mutate(Quarter = yearmonth(Quarter)) %>%
as_tsibble(index = Quarter)mytimeseries %>%
pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y") In the plot below, we illustrate what happens when we don’t use the
facet_grid command.
mytimeseries %>%
pivot_longer(-Quarter) %>%
ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() We see that all three timeseries are displayed on the same numerical scale. The problem is that the units of measurement for Sales and AdBudget are probably in dollar terms, while the GDP is probably an index level (not a dollar figure). Thus, we suspect it is invalid to compare the y-values of GDP to Sales or AdBudget.
The USgas package contains data on the demand for natural gas in the US.
USgas package.us_total with year as the index and state as the key.After installing the USgas package in RStudio from the command line with the install.packages command, we will create a tsibble from the us_total dataset as required.
library(USgas)
ts_us_total = as_tsibble(us_total, index = year, key = state )The state column represents either a US state, territory or the entire country. There are 53 entries including District of Columbia, Federal Offshore – Gulf of Mexico and entire country. The numeric column y is yearly total natural gas consumption in millions of cubic feet.
Using autoplot we obtain the line chart below. But notice that Massachusetts and Connecticut consume much more natural gas than Vermont and New Hampshire.
ts_us_total %>%
filter(state %in%
c("Maine", "Vermont", "New Hampshire",
"Massachusetts", "Connecticut", "Rhode Island")) -> newengland
newengland %>% autoplot(y) + labs(title="Annual Natural Gas Consumption - New England",
subtitle="1997-2019",
y = "Millions Cubic Ft") While we see trends clearly in the higher consuming states, the graph is too flat to make inferences. We transform the y axis to a logarithmic (base 10) scale to better compare consumption trends.
newengland %>% ggplot(aes(x=year,y=y, color = state)) + geom_line() +
scale_y_log10() +
labs(title = "Annual Natural Gas Consumption - New England",
subtitle = "Log 10 scale: 1997-2019",
y = "Log Millions of Cubic feet") We observe a slight positive trend in Massachusetts, Vermont, Connecticut. A jump transition for Maine and New Hampshire that warrants further inquiry. Rhode Island appears to follow a “U” shape trend.
tourism.xlsx from \(\color{red}{\text{the book website}}\) and read it into R using readxl::read_excel().tourism tsibble from the tsibble package.Region and Purpose had the maximum number of overnight trips on average.Part a. After downloading the Excel file, we read it into R using readxl::read_excel to an object tourismxl.
Part b. We transform the quarter text field to quarterly interval type using the yearquarter function in tsibble. To construct the tsibble, we specify as a key the 3 columns Region, State, and Purpose and name the tsibble as tourism2.
tourismxl = readxl::read_excel("tourism.xlsx")
tourismxl %>% mutate( Quarter = yearquarter(Quarter)) %>%
as_tsibble(key=c(Region, State, Purpose), index = Quarter) -> tourism2
tourism2## # A tsibble: 24,320 x 5 [1Q]
## # Key: Region, State, Purpose [304]
## Quarter Region State Purpose Trips
## <qtr> <chr> <chr> <chr> <dbl>
## 1 1998 Q1 Adelaide South Australia Business 135.
## 2 1998 Q2 Adelaide South Australia Business 110.
## 3 1998 Q3 Adelaide South Australia Business 166.
## 4 1998 Q4 Adelaide South Australia Business 127.
## 5 1999 Q1 Adelaide South Australia Business 137.
## 6 1999 Q2 Adelaide South Australia Business 200.
## 7 1999 Q3 Adelaide South Australia Business 169.
## 8 1999 Q4 Adelaide South Australia Business 134.
## 9 2000 Q1 Adelaide South Australia Business 154.
## 10 2000 Q2 Adelaide South Australia Business 169.
## # … with 24,310 more rows
We show it is equivalent to the tsibble tourism in the tsibbledata package using the all.equal() function.
# tourism and tourism2 are equal
all.equal(tourism, tourism2)## [1] TRUE
Part c. Normal dplyr verbs behave differently for tibble and tsibble objects. In order to aggregate time series values over time (Quarter), Region and Purpose, we convert the tsibble back to a regular tibble.
# Remove the tsibble attributes of the dataset so we can aggregate over time
tourism %>% as_tibble() -> att
# Aggregate over Quarter and State.
# Calculate average number of trips over all Quarters and States within each Region and Purpose
# Arrange in descending over by average trip count
# Select only top 5
# Display with kable
att %>% group_by(Region, Purpose) %>% summarize(meanTrips = mean(Trips)) %>%
arrange(desc(meanTrips)) %>% head() ## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: Region [4]
## Region Purpose meanTrips
## <chr> <chr> <dbl>
## 1 Sydney Visiting 747.
## 2 Melbourne Visiting 619.
## 3 Sydney Business 602.
## 4 North Coast NSW Holiday 588.
## 5 Sydney Holiday 550.
## 6 Gold Coast Holiday 528.
We conclude that Sydney-Visiting is the tourism Region and Purpose with the greatest average number of trips - 747 thousand trips.
Part d.
The following code creates a new tsibble called tourism3 which combines Purpose and Region. The key is State and index is still Quarter. Note that we ungroup() to flatten the resulting summary and allow conversion to a tsibble(). A few rows of the resulting tsibble are showed.
att %>% group_by(Quarter, State) %>% summarize(Trips = sum(Trips)) %>%
ungroup() %>% as_tsibble(index=Quarter, key=State) -> tourism3## `summarise()` has grouped output by 'Quarter'. You can override using the `.groups` argument.
tourism3## # A tsibble: 640 x 3 [1Q]
## # Key: State [8]
## Quarter State Trips
## <qtr> <chr> <dbl>
## 1 1998 Q1 ACT 551.
## 2 1998 Q2 ACT 416.
## 3 1998 Q3 ACT 436.
## 4 1998 Q4 ACT 450.
## 5 1999 Q1 ACT 379.
## 6 1999 Q2 ACT 558.
## 7 1999 Q3 ACT 449.
## 8 1999 Q4 ACT 595.
## 9 2000 Q1 ACT 600.
## 10 2000 Q2 ACT 557.
## # … with 630 more rows
Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):
set.seed(12345678)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))Explore your chosen retail time series using the following functions:
autoplot(), gg_season(), gg_subseries(), gg_lag(),
ACF() %>% autoplot()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
Using our own seed, we analyze the Pharmaceutical, cosmetic and toiletry good retail sector for Western Australia below. A few rows of the data are shown below for Western Australia below. The Turnover is measured in $Million AUD. We will first present some diagnostic plots followed by interpretations of the results. Western Australia is a vast region with population concentrated in the city of Perth on the coast.
set.seed(392381)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
head(myseries)## # A tsibble: 6 x 5 [1M]
## # Key: State, Industry [1]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X 1982 Apr 14.6
## 2 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X 1982 May 15.2
## 3 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X 1982 Jun 14.5
## 4 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X 1982 Jul 14.6
## 5 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X 1982 Aug 15.3
## 6 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X 1982 Sep 15.1
The line plot below overlays the turnover timeseries with a red trendline constructed using the STL decomposition from the fpp3 package. We observe an upward trend from 1982-2005 followed by an increased trend from 2005-2019.
dcmp <- myseries %>% model(stl= STL(Turnover))
stl_info = components(dcmp) %>% as_tsibble()
autoplot(stl_info, Turnover) +
labs(title="Retail Turnover of Pharmaceutical, cosmetic, toiletry goods",
subtitle="Western Australia",
y = "AUD$ Millions") + geom_line(aes(y=trend), color="red")Seasonality is broken down by year using gg_season below.
gg_season(myseries, Turnover ) The subseries by month shows how Turnover has increased for each month from 1982 - 2018.
gg_subseries( myseries, Turnover)The lag plot is shown below with 12 lagging periods consistent with the monthly observations to detect annual patterns.
gg_lag(myseries, Turnover, geom="point", lags = 1:12) +
labs(title="12 lags of Retail Turnover",
subtitle="Western Australia - Pharma, cosmetic, toiletry") We show the autocorrelation plot of the Turnover data below.
myseries %>% ACF(Turnover, demean=TRUE) %>%
autoplot() + labs(title="Pharmaceuticals, cosmetic and toiletry") Lastly, we add an extra chart to extract the trend, seasonality and residual pattern of this time series.
components(dcmp) %>% autoplot() + labs(title="STL decomposition - Retail Turnover")This timeseries exhibits an upward trend that accelerates after 2005 to 2018. In the trend chart of the STL decomposition, we also notice a decline in the trend in 2010 and 2012. These are consistent with real GDP declines for Australia during those periods. Unlike other countries (like the US), Australia did not experience a recession during 2009. This is reflected in the retail sales data too - there is no 2008 dip.
There is a significant seasonal effect in sales. In December, sales are generally highest. In February, generally lowest. March, May, August and October are generally strong months. April, June, September and November are weaker months.
The lag chart is generally uninformative in comparison to the timeseries and seasonality charts.
The subseries charts shows that a consistent pattern. For each calendar month, the Turnover has increased each year following the same trend.
Autocorrelations are very positive because they are dominated by the trend effect. So this chart is not very informative.
In the STL decomposition, we see increasing variance in the seasonal and residual time series components. There are 2 possible explanations for this growth in my opinion. First, the dollar amounts are not inflation or GDP adjusted. They are raw dollar volumes of sales. We know that Western Australia chief economic output is minerals and mining (extractive commodities) hence subject to global inflation pressures. Second, the dollar amounts are not populated adjusted.
GDP of Australia Perth, the main population center of Western Australia, had rapid population growth in 2008-2016. According to Wikipedia, Perth grew annually by 3.03% in 2008, 2.84% in 2010.
Lastly, the statistical methodology to compute Turnover was revised by the Australia Bureau of Statistics in July 2000 to include goods and services tax. I expect this to have a small positive impact on the trend growth.
Lastly, we cannot disentangle the sales data of Pharmaceuticals from toiletries and cosmetics. From Section 2.4 of FPP3, the authors argue that Australian consumers “stockpile before the end of the calendar year” for anti-diabetes drugs. These effects are likely to be tied to annual drug price increases controlled by the government. However, prices of toiletries and cosmetics prices are less regulated for than drugs.