Data 624 Assignment 1

Load packages

library(fpp3)

Instructions

Please submit exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book. Please submit both your Rpubs link as well as attach the .pdf file with your code.

Exercise 1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

Use ? (or help()) to find out about the data in each series.

? aus_production
? pelt
? gafa_stock
? vic_elec

Bricks from aus_production is a time series on quarterly production of clay bricks in Australia from 1956 to 2010. Quantities are expressed in millions of bricks. The details for this tsibble state that it is half-hourly, but this looks to be incorrect as records in the table appear to be quarterly (aligned with the info title). Lynx from pelt is a time series of number of Canadian Lynx pelts traded by the Hudson Bay Company from 1845 to 1935. Close from gafa_stock is a time series on closing stock prices from 2014 to 2018 for Google, Amazon, Facebook, and Apple. Prices are expressed in USD. Demand from vic_elec is a time series on total electricity demand in Victoria, Australia from 2012 to 2014. It is a half-hourly tsibble and quantities for Demand are expressed in MWh.

What is the time interval of each series?

print(aus_production, n = 218)
print(min(vic_elec$Date))
print(max(vic_elec$Date))

aus_production - 1956 to 2010 pelt - 1845 to 1935 gafa_stock - 2014 to 2018 vic_elec - 2012 to 2014

Use autoplot() to produce a time plot of each series

autoplot(aus_production,Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

autoplot(pelt,Lynx)

autoplot(gafa_stock,Close)

autoplot(vic_elec,Demand)

For the last plot, modify the axis labels and title.

autoplot(vic_elec,Demand) + 
  labs(title = "Time Plot of Electricity Demand in Victoria, Australia from 2012 to 2014", 
       y = "Demand in Mwh", x = "Time")

Exercise 2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

mod_gafa_stock <- gafa_stock %>% group_by(Symbol) %>% filter(Close == max(Close)) 
mod_gafa_stock

## # A tsibble: 4 x 8 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date        Open  High   Low Close Adj_Close   Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 AAPL   2018-10-03  230.  233.  230.  232.      230. 28654800
## 2 AMZN   2018-09-04 2026. 2050. 2013  2040.     2040.  5721100
## 3 FB     2018-07-25  216.  219.  214.  218.      218. 58954200
## 4 GOOG   2018-07-26 1251  1270. 1249. 1268.     1268.  2405600

Exercise 3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

You can read the data into R with the following script:

tute1 <- readr::read_csv("tute1.csv")
View(tute1)

Convert the data to time series

mytimeseries <- tute1 |>
  mutate(Quarter = yearquarter(Quarter)) |>
  as_tsibble(index = Quarter)

mytimeseries

## # A tsibble: 100 x 4 [1Q]
##    Quarter Sales AdBudget   GDP
##      <qtr> <dbl>    <dbl> <dbl>
##  1 1981 Q1 1020.     659.  252.
##  2 1981 Q2  889.     589   291.
##  3 1981 Q3  795      512.  291.
##  4 1981 Q4 1004.     614.  292.
##  5 1982 Q1 1058.     647.  279.
##  6 1982 Q2  944.     602   254 
##  7 1982 Q3  778.     531.  296.
##  8 1982 Q4  932.     608.  272.
##  9 1983 Q1  996.     638.  260.
## 10 1983 Q2  908.     582.  280.
## # ℹ 90 more rows

Construct time series plot of each of the three series

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

Check what happens when you don’t include face_grid()

mytimeseries |>
  pivot_longer(-Quarter) |>
  ggplot(aes(x = Quarter, y = value, colour = name)) +
  geom_line()

Without facet_grid, the gridlines are more spaced out and the plots are less separate.

Exercise 4

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package.

library(USgas)

Create a tsibble from us_total with year as the index and state as the key.

USgasTsibble <-  us_total %>%
  as_tsibble(index = year, key = state)

USgasTsibble

## # A tsibble: 1,266 x 3 [1Y]
## # Key:       state [53]
##     year state        y
##    <int> <chr>    <int>
##  1  1997 Alabama 324158
##  2  1998 Alabama 329134
##  3  1999 Alabama 337270
##  4  2000 Alabama 353614
##  5  2001 Alabama 332693
##  6  2002 Alabama 379343
##  7  2003 Alabama 350345
##  8  2004 Alabama 382367
##  9  2005 Alabama 353156
## 10  2006 Alabama 391093
## # ℹ 1,256 more rows

Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

NewEngland <- c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts', 'Connecticut', 'Rhode Island')
USgasTsibble <- USgasTsibble %>% filter(state %in% NewEngland)
autoplot(USgasTsibble, y)

Exercise 5

Download tourism.xlsx from the book website

Read it into R using readxl::read_excel()

tourismCopy <- readxl::read_excel("tourism.xlsx")
view(tourismCopy)

Create a tsibble which is identical to the tourism tsibble from the tsibble package.

tourism

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

tourismCopy <-  tourismCopy %>%
  mutate(Quarter = yearquarter(Quarter)) %>%   
  as_tsibble(index = Quarter, key = Region | State | Purpose)
tourismCopy

## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

Find what combination of Region and Purpose had the maximum number of overnight trips on average.

modTourismCopy <- tourismCopy %>% group_by(Region, Purpose) %>% 
  mutate(averageTrips = mean(Trips)) %>% filter(averageTrips == max(averageTrips))
print(modTourismCopy, n = 1)

## # A tsibble: 24,320 x 6 [1Q]
## # Key:       Region, State, Purpose [304]
## # Groups:    Region, Purpose [304]
##   Quarter Region   State           Purpose  Trips averageTrips
##     <qtr> <chr>    <chr>           <chr>    <dbl>        <dbl>
## 1 1998 Q1 Adelaide South Australia Business  135.         156.
## # ℹ 24,319 more rows

Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourismCopy2 <- readxl::read_excel("tourism.xlsx")
tourismCopy2 <-  tourismCopy2 %>%
  mutate(Quarter = yearquarter(Quarter)) %>%  
  group_by(State, Quarter) %>%
  summarise(Trips = sum(Trips)) %>%
  distinct() %>%
  as_tsibble(index = Quarter, key = State)
tourismCopy2

## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
## # Groups:    State [8]
##    State Quarter Trips
##    <chr>   <qtr> <dbl>
##  1 ACT   1998 Q1  551.
##  2 ACT   1998 Q2  416.
##  3 ACT   1998 Q3  436.
##  4 ACT   1998 Q4  450.
##  5 ACT   1999 Q1  379.
##  6 ACT   1999 Q2  558.
##  7 ACT   1999 Q3  449.
##  8 ACT   1999 Q4  595.
##  9 ACT   2000 Q1  600.
## 10 ACT   2000 Q2  557.
## # ℹ 630 more rows

Exercise 8

Use the following graphics functions: autoplot(), gg_season(), gg_subseries(), gg_lag(), ACF() and explore features from the following time series: “Total Private” Employed from us_employment, Bricks from aus_production, Hare from pelt, “H02” Cost from PBS, and Barrels from us_gasoline.

Can you spot any seasonality, cyclicity and trend? What do you learn about the series? What can you say about the seasonal patterns? Can you identify any unusual years?

us_employment %>% filter(Title == "Total Private") %>% gg_season(Employed)

us_employment %>% filter(Title == "Total Private") %>% gg_subseries(Employed)

us_employment %>% filter(Title == "Total Private") %>% gg_lag(Employed, geom = "point")

us_employment %>% filter(Title == "Total Private") %>% ACF(Employed) %>% autoplot()

aus_production  %>% gg_season(Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production  %>% gg_subseries(Bricks)

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

pelt  %>% gg_subseries(Hare)

PBS %>% filter(ATC2 == "H02") %>% gg_season(Cost)

PBS %>% filter(ATC2 == "H02") %>% gg_subseries(Cost)

PBS %>% filter(ATC2 == "H02") %>% ACF(Cost) %>% autoplot()

us_gasoline  %>% gg_season(Barrels)

us_gasoline  %>% gg_lag(Barrels, geom = "point")

us_gasoline  %>% ACF(Barrels) %>% autoplot()

For seasonality, there is potentially a pattern in us_employment, with summer months appearing to have slightly higher employment. However, this seasonality does not appear discernible when using gg_lag(). From the subseries we can see that the number employed has been increasing by year. We can see through the ACF plot that the data is a trended time series since there are positive values that decrease as lags increase.

Through the years, there also seems to be cyclical spikes in brick production in Australia, probably corresponding to booms in certain industry (e.g. housing) that may require brick. There looks to be a large spike in brick production in 1981 and 1989.

Similarly, there appears to be periodic demand for hare pelts as production spikes and wanes every 10 years. There are two years with unusually large spikes around 1863 and 1885.

PBS is a tsibble of monthly medicare prescription data in Australia. There seems to be seasonality in the concessional safety net, with a discernible increase starting from February into December, and then a sharp decline in January to February. The same can be said about the general safety net. Shown by the autocorrelations, all of the plots are seasonal in some sense, with a very clear pattern of larger autocorrelations every 12 months.

The US gasoline time series shows trended and seasonal data, shown by the autocorrelations which decrease in general but also spike periodically in comparison to surrounding lags.