Chapters 1 and 2, Hyndman and Athanasopoulos

Use the help function to explore what the series gafa_stock, PBS, vic_elec and pelt represent.

The help function (?) displays info about each series in the Help window. The descriptions for each series are copied into the comments below.

?gafa_stock  # Historical stock prices from 2014-2018 for Google, Amazon, Facebook and Apple.
             # All prices are in $USD
             # Variables include: Open, High, Low, Close, Adj_Close, Volume over 
             # one key (Symbol)

?PBS         # Monthly Medicare Australia prescription data. 
             # Variables include: Scripts and Cost over 
             # four keys (Concession, Type, ATC1, ATC2)

?vic_elec    # Half-hourly electricity demand for Victoria, Australia. 
             # Variables include: Demand, Temperature, Date, Holiday over 
             # one key (Datetime)

?pelt        # Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx
             # furs from 1845 to 1935.
             # This data contains trade records for all areas of the company. 
             # Variables include: Hare and Lynx trades over 
             # one key (Year)

Use autoplot() to plot some of the series in these data sets.

I interpret this task as selecting one or more series from each data set and plotting a time series.

The first plot is Amazon’s stock volumes over time (day).

AMZN <- gafa_stock %>%
  filter(Symbol == "AMZN") %>%          # filter for AMZN only
      mutate(Vol = Volume/1e6) %>%      # divide Volume by 1M for y-axis scaling
        select(Date,Vol) %>%            # select only the Date and Vol columns
          mutate(Date = ymd(Date)) %>%  # transform the Date text to date type
            as_tsibble(index = Date)    # set as a tsibble

autoplot(AMZN,Vol) +
  labs(y = "Volume (M)",
       title = "Amazon (AMZN) stock volume")

The second plot is for ACT2 series D01 total cost over time (month).

D01 <- PBS %>%
  filter(ATC2 == "D01") %>%                 # filter for ATC2 type "D01"
    select(Month,Concession,Type,Cost) %>%  # select specific columns
      summarize(TotalC = sum(Cost)) %>%     # summarize the cost grouping by the selected columns
        mutate(Cost = TotalC/1e5)           # divide Cost by 100K for y-axis scaling

autoplot(D01,Cost) +
  labs(y = "Total Cost ('000)",
       title = "Total Cost of ATC2 D01")

The third plot is electricity demand in Victoria, Australia over time(half-hours).

e <- vic_elec %>%
  filter(year(Time) == 2014)  # filter for only 2014 data

autoplot(e,Demand) +
  labs(y = "GW", title = "Half-hourly electricity demand: Victoria")

The fourth plot looks at volumes of Hare pelts over time (year).

Hares <- pelt %>%
  select(Year,Hare)

autoplot(pelt,Hare) +
  labs(y = "Hares", title = "Annual Hare pelt volumes")

What is the time interval of each series?

The time interval for stock prices is daily, for pharmaceuticals is monthly, for electricity is half-hourly and for pelt volumes is annually.

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

peak_close_days <- gafa_stock %>%
  group_by(Symbol) %>%              # group by Symbol so that the data set returns a High price for each stock
    filter(High == max(High)) %>%   # filter the data set for the High peak for each stock
      select(Symbol,High)           # select only the stock symbol and High price for each stock
                                    # note that Date will also display because it is the key to the tsibble.

peak_close_days

## # A tsibble: 4 x 3 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol  High Date      
##   <chr>  <dbl> <date>    
## 1 AAPL    233. 2018-10-03
## 2 AMZN   2050. 2018-09-04
## 3 FB      219. 2018-07-25
## 4 GOOG   1274. 2018-07-27

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into R with the following script:

tute1 <- readr::read_csv("tute1.csv") # the data is imported as a tibble

## Rows: 100 Columns: 4

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Quarter
## dbl (3): Sales, AdBudget, GDP

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(tute1)

## # A tibble: 6 x 4
##   Quarter   Sales AdBudget   GDP
##   <chr>     <dbl>    <dbl> <dbl>
## 1 3/1/1981  1020.     659.  252.
## 2 6/1/1981   889.     589   291.
## 3 9/1/1981   795      512.  291.
## 4 12/1/1981 1004.     614.  292.
## 5 3/1/1982  1058.     647.  279.
## 6 6/1/1982   944.     602   254

Convert the data to time series

mytimeseries <- tute1 %>%
  mutate(Quarter = yearmonth(Quarter)) %>%  # mutate the Quarter <chr> to a monthly series <S3: yearmonth>
  as_tsibble(index = Quarter)               # set the Quarter field as the index to the tsibble

head(mytimeseries)

## # A tsibble: 6 x 4 [3M]
##    Quarter Sales AdBudget   GDP
##      <mth> <dbl>    <dbl> <dbl>
## 1 1981 Mar 1020.     659.  252.
## 2 1981 Jun  889.     589   291.
## 3 1981 Sep  795      512.  291.
## 4 1981 Dec 1004.     614.  292.
## 5 1982 Mar 1058.     647.  279.
## 6 1982 Jun  944.     602   254

Construct time series plots of each of the three series

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include facet_grid()

The series area plotted on a single plot with a single shared y-axis scale.

mytimeseries %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.

if (!require("USgas",character.only = TRUE)) (install.packages("USgas",dep=TRUE))
library(USgas)

Create a tsibble from us_total with year as the index and state as the key.

us_total <- us_total %>%
  as_tsibble(key = state,
             index = year)

head(us_total)

## # A tsibble: 6 x 3 [1Y]
## # Key:       state [1]
##    year state        y
##   <int> <chr>    <int>
## 1  1997 Alabama 324158
## 2  1998 Alabama 329134
## 3  1999 Alabama 337270
## 4  2000 Alabama 353614
## 5  2001 Alabama 332693
## 6  2002 Alabama 379343

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

NEgas <- us_total %>%
  filter(state == c('Maine','Vermont','New Hampshire','Massachusetts','Connecticut','Rhode Island')) %>% # select only the New England states
    mutate(y = y/1000)   # mutate the yearly totals for y-scaling
  
autoplot(NEgas,y) + labs(title = "Annual natural gas consumption by state",
                         subtitle = "New England area",
                         y = "yearly total (B)")  # the scaling above mutated the M values into B

Although it is clear in the above plot that Massachusetts uses far more natural gas than Vermont, it is not clear if there is any variation in usage in the states with less utilization.

It turns out that it possible to apply a facet grid to autoplot(), which then enables the use free y-axis scaling. This permits us to see that there is indeed variation in Vermont’s usage as seen below, as well as for Rhode Island.

autoplot(NEgas,y) + 
  labs(title = "Annual natural gas consumption by state",
       subtitle = "New England area",
       y = "yearly total (B)") +
    facet_grid(state ~ ., scales = "free_y")

1. Download tourism.xlsx from the book website and read it into R using readxl::read_excel()

tourism <- readxl::read_excel("tourism.xlsx")  # the data is imported as a tibble
head(tourism)

## # A tibble: 6 x 5
##   Quarter    Region   State           Purpose  Trips
##   <chr>      <chr>    <chr>           <chr>    <dbl>
## 1 1998-01-01 Adelaide South Australia Business  135.
## 2 1998-04-01 Adelaide South Australia Business  110.
## 3 1998-07-01 Adelaide South Australia Business  166.
## 4 1998-10-01 Adelaide South Australia Business  127.
## 5 1999-01-01 Adelaide South Australia Business  137.
## 6 1999-04-01 Adelaide South Australia Business  200.

Create a tsibble which is identical to the tourism tsibble from the tsibble package

tourism_ts <- tourism %>%
  mutate(Quarter = yearquarter(Quarter)) %>%      # mutate the Quarter <chr> to a quarterly series <S3: yearquarter>
  as_tsibble(key = c("Region","State","Purpose"), # set the Quarter field as the index to the tsibble
            index = Quarter)

head(tourism_ts)

## # A tsibble: 6 x 5 [1Q]
## # Key:       Region, State, Purpose [1]
##   Quarter Region   State           Purpose  Trips
##     <qtr> <chr>    <chr>           <chr>    <dbl>
## 1 1998 Q1 Adelaide South Australia Business  135.
## 2 1998 Q2 Adelaide South Australia Business  110.
## 3 1998 Q3 Adelaide South Australia Business  166.
## 4 1998 Q4 Adelaide South Australia Business  127.
## 5 1999 Q1 Adelaide South Australia Business  137.
## 6 1999 Q2 Adelaide South Australia Business  200.

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

It appears that Sydney had the highest average number of trips for the purpose of Visiting.

max_trips <- tourism_ts %>%                  
  group_by(Region,Purpose) %>%              # group by Region and Purpose so that the data set returns the max trips for each group
    mutate(Trips = mean(Trips)) %>%         # find the average number of trips per group
      distinct(Region,Purpose,Trips) %>%    # get the distinct values
        ungroup %>%                         # ungroup the data
          filter(Trips == max(Trips))       # filter for the Region and Purpose with the highest average
        
max_trips

## # A tibble: 1 x 3
##   Region Purpose  Trips
##   <chr>  <chr>    <dbl>
## 1 Sydney Visiting  747.

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourism_totals <- tourism %>%
  select(-Region,-Purpose) %>%        # remove the Region and Purpose columns
    group_by(State) %>%               # group by State
      mutate(Trips = sum(Trips)) %>%  # sum the Trips by State
        distinct(State,Trips) %>%     # display the totals by each State
          arrange(State)              # arrange the States alphabetically

tourism_totals

## # A tibble: 8 x 2
## # Groups:   State [8]
##   State                Trips
##   <chr>                <dbl>
## 1 ACT                 41007.
## 2 New South Wales    557367.
## 3 Northern Territory  28614.
## 4 Queensland         386643.
## 5 South Australia    118151.
## 6 Tasmania            54137.
## 7 Victoria           390463.
## 8 Western Australia  147820.

Monthly Australian retail data is provided in aus_retail. Select one of the time series as follows (but choose your own seed value):

set.seed(54321)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

head(myseries)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State           Industry           `Series ID`    Month Turnover
##   <chr>           <chr>              <chr>          <mth>    <dbl>
## 1 South Australia Clothing retailing A3349360V   1982 Apr     19.1
## 2 South Australia Clothing retailing A3349360V   1982 May     21.6
## 3 South Australia Clothing retailing A3349360V   1982 Jun     18.3
## 4 South Australia Clothing retailing A3349360V   1982 Jul     18.6
## 5 South Australia Clothing retailing A3349360V   1982 Aug     17.1
## 6 South Australia Clothing retailing A3349360V   1982 Sep     18.2

Explore your chosen retail time series using the following functions:

autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() %>% autoplot()

Autoplot the Turnover by month.

autoplot(myseries,Turnover) +
  labs(title = "Turnover by Month")

Do a gg_season plot of the Turnover by month.

myseries %>% gg_season(Turnover, period = "year") +
  theme(legend.position = "right") +
  labs(y = "Turnover", title="Turnover by Month")

Do a gg_subseries plot of the Turnover by month.

myseries %>% gg_subseries(Turnover) +
  labs(y = "Turnover", title="Turnover by Month")

Do a gg_lag plot of the Turnover by month.

myseries %>% gg_lag(Turnover, geom = "point") +
  labs(x = "lag(Turnover,k)", title="Turnover by Month")

Do an Autocorrelation plot of the Turnover by month.

myseries %>%
  ACF(Turnover) %>%
    autoplot() + labs(title = "Turnover by Month")

Can you spot any seasonality, cyclicity and trend?

The autoplot hints at some seasonality in the “heartbeat” pattern within each year, and there also appears that there was an upward trend between about 2000 and 2010. The gg_season plot confirms that there is seasonality with a spike in December followed by a low point in the following February of each year, perhaps influenced by the Christmas/Boxing holidays. Looking at the year-over-year trend in each month in the gg_subseries plot, we can also see the upward trend from 2000 to 2010, which was both preceded and followed by what may be cyclic behaviors. The gg_lag plot shows a strong positive relationship in turnover within every lag period, which confirms the annual seasonality in the data. Lastly, the autocorrelation plot displays both seasonality and trending with teh scalloped shape from year to year and the overall slow decrease in the trends.

What do you learn about the series?

Retail clothing turnover in South Australia follows a seasonal trend that displays cyclical patterns of rising and falling approximately every 10 years. Further inquiry is warranted to discover more about the data set.

Data 624 Predictive Analytics

Douglas Barley

2/6/2022

Chapters 1 and 2, Hyndman and Athanasopoulos