Exercise 2.1

Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent.

Use autoplot() to plot some of the series in these data sets.
What is the time interval of each series?

Solution

We examine the documentation of each dataset using ?<dataset-name> The interval command informs us of the interval mapped to each tsibble object.

gafa_stock

The command ?gafa_stock informs us this dataset contains historical stock prices for four technology companies Google, Amazon, Facebook and Apple for 2014-2018. We note that Adj_Close is the best quantitative price series to use because it mitigates the effects of dividends and stock splits. That is, the adjustment is required to alow valid comparison of performance of a stock across long periods of time when corporate actions like dividends or stock splits may occur.

autoplot(gafa_stock, Adj_Close) + 
  labs(title="Stock prices from 2014-2018 in $USD",
       subtitle="Google, Amazon, Facebook, Apple" ,
        y = "Price in $USD"
                                    )

The R command below shows that the interval of gafa_stock is irregular.

interval(gafa_stock)

## <interval[1]>
## [1] !

PBS

PBS is a tsibble of monthly Medicare Australian prescription data. The index is the column Month and the key is 4 tuple of columns: Concession, Type, ATC1, and ATC2. The chart below shows 4 time series associated with PBS dataset associated with 4 combinations of attributes related to Concession and Type. We plot the number of Scripts below for Analgesics.

We observe an increasing trend and annual seasonality in all 4 time series. Note that the interval is 1M. We also note that seasonal variation increases over the years.

interval(PBS)

## <interval[1]>
## [1] 1M

PBS %>% filter( ATC2_desc == "ANALGESICS" ) %>% 
  select(Month, Concession, Type, ATC1, ATC1_desc, 
         ATC2, ATC2_desc, Cost, Scripts) -> Analgesics

Analgesics %>% autoplot( Scripts ) + 
  labs(title="Analgesics Prescriptions Volume", 
       subtitle="AT2: N02", y = "Number of Scripts" )

Analgesics %>% autoplot( Cost ) + 
  labs(title="Analgesics Prescriptions Cost", 
       subtitle="ATC2: N02", y = "Cost of Scripts" )

vic_elec

vic_elec represents half-hourly demand for electricity in Victoria, Australia. The tsibble interval is 30min. There is no key column. So the Time column uniquely identifies each observation. The numeric columns of interest are Demand (in MW) and Temperature in (Celsius).

We autoplot the demand and temperature as time series. Both series appear to show no trend but daily and annual seasonality. There may be cyclicality but I cannot see it by visual inspection.

interval(vic_elec)

## <interval[1]>
## [1] 30m

vic_elec %>% select( Time, Demand) %>% autoplot(Demand) + 
  labs(title="Half-Hourly Demand for Electric",
  subtitle = "Victoria, Australia", 
  y = "MegaWatts")

vic_elec %>% select( Time, Temperature) %>% autoplot(Temperature) + 
  labs(title="Half-Hourly Temperature",
      subtitle = "Victoria, Australia", 
      y = "Celsius")

pelt

pelt represents annual pelt trading for two species: Hare and Lynx from 1845 to 1935. The interval is annual. There is no key so each row is uniquely identified by year. However, we pivot the tsibble to a longer format by adding a species and count column to make the data easier to plot.

We observe cyclicality in the pelt count which may indicate a prey-predator relation that may cause Hare and Lynx populations to oscillate. A high count of lynxes is usually followed by a drop-off in Hare populations in later years. We don’t observe seasonality because the oscillations don’t follow a fixed period of time.

interval(pelt)

## <interval[1]>
## [1] 1Y

pivot_longer(pelt, !Year, names_to="species", values_to = 'count' ) %>% autoplot() + 
  labs(title="Pelts Trade by Species", 
       subtitle = "Hudson Bay Company Records" , 
       y = "Number of Pelts traded" )

## Plot variable not specified, automatically selected `.vars = count`

Exercise 2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

Solution

gafa_stock %>% group_by(Symbol ) %>% 
  mutate( max_price = max( Adj_Close)) %>% 
  filter( Adj_Close == max_price) %>% 
  select( Symbol, Date, Close, Adj_Close, Volume )

## # A tsibble: 4 x 5 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date       Close Adj_Close   Volume
##   <chr>  <date>     <dbl>     <dbl>    <dbl>
## 1 AAPL   2018-10-03  232.      230. 28654800
## 2 AMZN   2018-09-04 2040.     2040.  5721100
## 3 FB     2018-07-25  218.      218. 58954200
## 4 GOOG   2018-07-26 1268.     1268.  2405600

The days of peak closing price should be in terms of the adjusted close - not the nominal close – to be economically accurate. However, this makes little practical difference for this dataset. The peak price dates for AAPL is 2018-10-03, for AMZN 2018-09-04, etc, as shown in tsibble output above.

Exercise 2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into R with the following script:

tute1 <- readr::read_csv("tute1.csv")
View(tute1)

Convert the data to time series

mytimeseries <- tute1 %>%
  mutate(Quarter = yearmonth(Quarter)) %>%
  as_tsibble(index = Quarter)

Construct time series plots of each of the three series

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid().

Solution

The code chunks in the above problem statement can be used to create a tsibble object with the required data.

tute1 <- readr::read_csv("tute1.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Quarter = col_date(format = ""),
##   Sales = col_double(),
##   AdBudget = col_double(),
##   GDP = col_double()
## )

mytimeseries <- tute1 %>%
  mutate(Quarter = yearmonth(Quarter)) %>%
  as_tsibble(index = Quarter)

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

In the plot below, we illustrate what happens when we don’t use the facet_grid command.

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

We see that all three timeseries are displayed on the same numerical scale. The problem is that the units of measurement for Sales and AdBudget are probably in dollar terms, while the GDP is probably an index level (not a dollar figure). Thus, we suspect it is invalid to compare the y-values of GDP to Sales or AdBudget.

Exercise 2.4

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.
Create a tsibble from us_total with year as the index and state as the key.
Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

Solution

After installing the USgas package in RStudio from the command line with the install.packages command, we will create a tsibble from the us_total dataset as required.

library(USgas)

ts_us_total = as_tsibble(us_total, index = year, key = state )

The state column represents either a US state, territory or the entire country. There are 53 entries including District of Columbia, Federal Offshore – Gulf of Mexico and entire country. The numeric column y is yearly total natural gas consumption in millions of cubic feet.

Using autoplot we obtain the line chart below. But notice that Massachusetts and Connecticut consume much more natural gas than Vermont and New Hampshire.

ts_us_total %>% 
  filter(state %in% 
     c("Maine", "Vermont", "New Hampshire",
       "Massachusetts", "Connecticut", "Rhode Island")) -> newengland

newengland %>% autoplot(y) + labs(title="Annual Natural Gas Consumption - New England",
                                  subtitle="1997-2019",
                                  y = "Millions Cubic Ft")

While we see trends clearly in the higher consuming states, the graph is too flat to make inferences. We transform the y axis to a logarithmic (base 10) scale to better compare consumption trends.

newengland %>% ggplot(aes(x=year,y=y, color = state)) + geom_line() + 
  scale_y_log10() + 
  labs(title = "Annual Natural Gas Consumption - New England", 
        subtitle = "Log 10 scale: 1997-2019",
        y = "Log Millions of Cubic feet")

We observe a slight positive trend in Massachusetts, Vermont, Connecticut. A jump transition for Maine and New Hampshire that warrants further inquiry. Rhode Island appears to follow a “U” shape trend.

Exercise 2.5

Download tourism.xlsx from $\color{red}{\text{the book website}}$ and read it into R using readxl::read_excel().
Create a tsibble which is identical to the tourism tsibble from the tsibble package.
Find what combination of Region and Purpose had the maximum number of overnight trips on average.
Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

Solution

Part a. After downloading the Excel file, we read it into R using readxl::read_excel to an object tourismxl.

Part b. We transform the quarter text field to quarterly interval type using the yearquarter function in tsibble. To construct the tsibble, we specify as a key the 3 columns Region, State, and Purpose and name the tsibble as tourism2.

tourismxl = readxl::read_excel("tourism.xlsx")

tourismxl %>% mutate( Quarter = yearquarter(Quarter)) %>% 
        as_tsibble(key=c(Region, State, Purpose), index = Quarter) -> tourism2



tourism2

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # … with 24,310 more rows

We show it is equivalent to the tsibble tourism in the tsibbledata package using the all.equal() function.

# tourism and tourism2 are equal
all.equal(tourism, tourism2)

## [1] TRUE

Part c. Normal dplyr verbs behave differently for tibble and tsibble objects. In order to aggregate time series values over time (Quarter), Region and Purpose, we convert the tsibble back to a regular tibble.

# Remove the tsibble attributes of the dataset so we can aggregate over time
tourism %>% as_tibble() -> att  

# Aggregate over Quarter and State.  
# Calculate average  number of trips over all Quarters and States within each Region and Purpose
# Arrange in descending over by average trip count
# Select only top 5 
# Display with kable
att %>% group_by(Region, Purpose) %>% summarize(meanTrips = mean(Trips)) %>%
   arrange(desc(meanTrips)) %>% head()

## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.

## # A tibble: 6 x 3
## # Groups:   Region [4]
##   Region          Purpose  meanTrips
##   <chr>           <chr>        <dbl>
## 1 Sydney          Visiting      747.
## 2 Melbourne       Visiting      619.
## 3 Sydney          Business      602.
## 4 North Coast NSW Holiday       588.
## 5 Sydney          Holiday       550.
## 6 Gold Coast      Holiday       528.

We conclude that Sydney-Visiting is the tourism Region and Purpose with the greatest average number of trips - 747 thousand trips.

Part d.

The following code creates a new tsibble called tourism3 which combines Purpose and Region. The key is State and index is still Quarter. Note that we ungroup() to flatten the resulting summary and allow conversion to a tsibble(). A few rows of the resulting tsibble are showed.

att %>% group_by(Quarter, State) %>% summarize(Trips = sum(Trips)) %>% 
   ungroup() %>% as_tsibble(index=Quarter, key=State) -> tourism3

## `summarise()` has grouped output by 'Quarter'. You can override using the `.groups` argument.

tourism3

## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
##    Quarter State Trips
##      <qtr> <chr> <dbl>
##  1 1998 Q1 ACT    551.
##  2 1998 Q2 ACT    416.
##  3 1998 Q3 ACT    436.
##  4 1998 Q4 ACT    450.
##  5 1999 Q1 ACT    379.
##  6 1999 Q2 ACT    558.
##  7 1999 Q3 ACT    449.
##  8 1999 Q4 ACT    595.
##  9 2000 Q1 ACT    600.
## 10 2000 Q2 ACT    557.
## # … with 630 more rows

Exercise 2.8

Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):

set.seed(12345678)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Explore your chosen retail time series using the following functions:

autoplot(), gg_season(), gg_subseries(), gg_lag(),

ACF() %>% autoplot()

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

Solution

Using our own seed, we analyze the Pharmaceutical, cosmetic and toiletry good retail sector for Western Australia below. A few rows of the data are shown below for Western Australia below. The Turnover is measured in $Million AUD. We will first present some diagnostic plots followed by interpretations of the results. Western Australia is a vast region with population concentrated in the city of Perth on the coast.

set.seed(392381)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

head(myseries)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State         Industry                           `Series ID`    Month Turnover
##   <chr>         <chr>                              <chr>          <mth>    <dbl>
## 1 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X   1982 Apr     14.6
## 2 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X   1982 May     15.2
## 3 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X   1982 Jun     14.5
## 4 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X   1982 Jul     14.6
## 5 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X   1982 Aug     15.3
## 6 Western Aust… Pharmaceutical, cosmetic and toil… A3349581X   1982 Sep     15.1

The line plot below overlays the turnover timeseries with a red trendline constructed using the STL decomposition from the fpp3 package. We observe an upward trend from 1982-2005 followed by an increased trend from 2005-2019.

dcmp <- myseries %>% model(stl= STL(Turnover))

stl_info = components(dcmp) %>% as_tsibble()

autoplot(stl_info, Turnover) + 
  labs(title="Retail Turnover of Pharmaceutical, cosmetic, toiletry goods",
      subtitle="Western Australia",
      y = "AUD$ Millions")  + geom_line(aes(y=trend), color="red")

Seasonality is broken down by year using gg_season below.

gg_season(myseries, Turnover )

The subseries by month shows how Turnover has increased for each month from 1982 - 2018.

gg_subseries( myseries, Turnover)

The lag plot is shown below with 12 lagging periods consistent with the monthly observations to detect annual patterns.

gg_lag(myseries, Turnover, geom="point", lags  = 1:12) + 
  labs(title="12 lags of Retail Turnover", 
       subtitle="Western Australia - Pharma, cosmetic, toiletry")

We show the autocorrelation plot of the Turnover data below.

myseries %>% ACF(Turnover, demean=TRUE) %>% 
  autoplot() + labs(title="Pharmaceuticals, cosmetic and toiletry")

Lastly, we add an extra chart to extract the trend, seasonality and residual pattern of this time series.

components(dcmp) %>% autoplot() + labs(title="STL decomposition - Retail Turnover")

Interpretation of plots

This timeseries exhibits an upward trend that accelerates after 2005 to 2018. In the trend chart of the STL decomposition, we also notice a decline in the trend in 2010 and 2012. These are consistent with real GDP declines for Australia during those periods. Unlike other countries (like the US), Australia did not experience a recession during 2009. This is reflected in the retail sales data too - there is no 2008 dip.
There is a significant seasonal effect in sales. In December, sales are generally highest. In February, generally lowest. March, May, August and October are generally strong months. April, June, September and November are weaker months.
The lag chart is generally uninformative in comparison to the timeseries and seasonality charts.
The subseries charts shows that a consistent pattern. For each calendar month, the Turnover has increased each year following the same trend.
Autocorrelations are very positive because they are dominated by the trend effect. So this chart is not very informative.
In the STL decomposition, we see increasing variance in the seasonal and residual time series components. There are 2 possible explanations for this growth in my opinion. First, the dollar amounts are not inflation or GDP adjusted. They are raw dollar volumes of sales. We know that Western Australia chief economic output is minerals and mining (extractive commodities) hence subject to global inflation pressures. Second, the dollar amounts are not populated adjusted.
GDP of Australia Perth, the main population center of Western Australia, had rapid population growth in 2008-2016. According to Wikipedia, Perth grew annually by 3.03% in 2008, 2.84% in 2010.
Lastly, the statistical methodology to compute Turnover was revised by the Australia Bureau of Statistics in July 2000 to include goods and services tax. I expect this to have a small positive impact on the trend growth.
Lastly, we cannot disentangle the sales data of Pharmaceuticals from toiletries and cosmetics. From Section 2.4 of FPP3, the authors argue that Australian consumers “stockpile before the end of the calendar year” for anti-diabetes drugs. These effects are likely to be tied to annual drug price increases controlled by the government. However, prices of toiletries and cosmetics prices are less regulated for than drugs.

Data 624 HW 1 (Week 2) Time Series

Alexander Ng

09/12/2021

Overview

Exercise 2.1

Solution

gafa_stock

PBS

vic_elec

pelt

Exercise 2.2

Solution

Exercise 2.3

Solution

Exercise 2.4

Solution

Exercise 2.5

Solution

Exercise 2.8

Solution

Interpretation of plots