DATA 624: Homework 1

1. Forecast package time series datasets

  1. Use the help function to explore what the series gold , woolyrnq and gas represent.
  1. Use autoplot() to plot each of these in separate plots.
  2. What is the frequency of each series? Hint: apply the frequency() function.
  3. Use which.max() to spot the outlier in the gold series. Which observation was it?
#help(gold)
#help(woolyrnq)
#help(gas)
#Daily morning gold prices in US dollars. 1 January 1985 - 31 March 1989.

Description of time series * gold: Quarterly production of woollen yarn in Australia: tonnes. Mar 1965 - Sep 1994. * woolyrnq: Australian monthly gas production: 1956-1995. Source: Time Series Data Library * gas: Australian monthly gas production: 1956-1995. Source: Australian Bureau of Statistics.

autoplot(gold, series = "Data") + 
  scale_colour_manual(values="blue")

autoplot(woolyrnq, series = "Data") + 
  scale_colour_manual(values="darkgreen")

autoplot(gas, series = "Data") + 
  scale_colour_manual(values="red")

print(paste0("The frequency of the gold time series is: ", frequency(gold)))
## [1] "The frequency of the gold time series is: 1"
print(paste0("The frequency of the woollen yarn time series is: ", frequency(woolyrnq)))
## [1] "The frequency of the woollen yarn time series is: 4"
print(paste0("The frequency of the gas time series is: ", frequency(gas)))
## [1] "The frequency of the gas time series is: 12"
# Find the outlier in the gold series
print(paste0("The outlier of the gold time series is: ", which.max(gold)))
## [1] "The outlier of the gold time series is: 770"

2. Quarterly Sales Data Time Series

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

  1. You can read the data into R with the following script:
url <- 'https://raw.githubusercontent.com/nealxun/ForecastingPrinciplePractices/master/extrafiles/tute1.csv'
tute1 <- read.csv(url, header=TRUE)
View(tute1)
  1. Convert the data to time series
mytimeseries <- ts(tute1[,-1], start=1981, frequency=4)

(The [,-1] removes the first column which contains the quarters as we don’t need them now.)

  1. Construct time series plots of each of the three series
autoplot(mytimeseries, facets=TRUE)

Check what happens when you don’t include facets=TRUE .

autoplot(mytimeseries, facets=FALSE)

With the facet argument set to false, the three time series appear on the same grid (each with unique color).

3. Australian Retail Data

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

retaildata <- readxl::read_excel('retail.xlsx', skip=1)
colnames(retaildata[93])
## [1] "A3349824F"
myts <- ts(retaildata[,"A3349824F"], frequency=12, start=c(1982,4))

Findings:

  • Non-stationarity (change in distribution function over time - increasing mean and standard deviation)
  • Multiplicative error (increasing volatility as values increase)
  • Exponential growth (better modelled as log-linear)
  • Monthly seasonality, with successive rise in 4th quarter peaking in December, and calendar-adjusted trough in April
  • Strong positive autocorrelation (due to persistent upward trend)
autoplot(myts, series="Data") + 
  scale_colour_manual(values="blue")

#Log-linear model
autoplot(log(myts))

#Calendar adjusting the series according to days per month
myts_caladj <- myts/monthdays(myts)
ggseasonplot(myts_caladj)

ggsubseriesplot(myts_caladj)

gglagplot(myts)

ggAcf(myts)

The annual seasonality is apparent from the first differenced ACF curve.

ggAcf(diff(myts))

6. Other Time Series Visualization

Use the following graphics functions: autoplot() , ggseasonplot() , ggsubseriesplot() , gglagplot() , ggAcf() and explore features from the following time series: hsales , usdeaths , bricksq , sunspotarea , gasoline.

(A) One-Family House Sales (hsales)

Description: Monthly sales of new one-family houses sold in the USA since 1973. Source: Makridakis, Wheelwright and Hyndman (1998). from fma package

Findings: * Stationarity (no apparent long-term trend; constant mean)
* 8-year cycles (peaks in 1978, 1986, and 1994; troughs in 1975, 1982-83, and 1991)
* Monthly seasonality, with calendar-adjusted peak in April, and progressive decline from August through to December
* Positive, wave-like, autocorrelation for first 15 lags (due to seasonality)

autoplot(hsales)

#Normalized mean-differenced series
autoplot((hsales - mean(hsales))/sd(hsales))

ggseasonplot(hsales)

hsales_caladj <- hsales/monthdays(hsales) #Calendar adjustment
ggsubseriesplot(hsales_caladj)

gglagplot(hsales)

ggAcf(hsales)

ggAcf(diff(hsales))

(B) Accidental Deaths (usdeaths)

Description: Monthly accidental deaths in USA. Source: Makridakis, Wheelwright and Hyndman (1998). from fma package

For life insurance purposes, the following are classified as accidental deaths: - Poisoning - Motor vehicle accidents - Falls - Murder (or Homicide) - Suffocation - Drowning - Fire or burns - Pedestrian deaths Source: https://www.glgamerica.com/what-is-accidental-death/

Findings: * Stationarity (no apparent long-term trend; constant mean)
* Monthly seasonality, with steep calendar-adjusted peak in July, and flat calendar-adjusted trough from January to March
* Positive, wave-like, autocorrelation (due to seasonality)
* No apparent cycle

usdeaths_caladj <- usdeaths/monthdays(usdeaths)
autoplot(usdeaths)

ggseasonplot(usdeaths_caladj)

ggsubseriesplot(usdeaths_caladj)

gglagplot(usdeaths)

ggAcf(usdeaths)

(C) quarterly clay brick production (bricksq)

Description: Australian quarterly clay brick production: 1956-1994. Source: Makridakis, Wheelwright and Hyndman (1998). from fma package

Findings: * Linear growth trend (especially from 1956-1980)
* Non-stationarity (change in distribution function over time - increasing mean, especially from 1956-1980)
* Multiplicative error (increasing volatility as values increase)
* Quarterly seasonality, with lull in Quarter 1 and slightly higher peak in Quarter 3
* Strongly positive autocorrelation, decreasing overall, except for being scalloped upward for seasonal lag 4

autoplot(bricksq)

ggseasonplot(bricksq, polar = TRUE)

ggsubseriesplot(bricksq)

gglagplot(bricksq)

ggAcf(bricksq)

ggAcf(diff(bricksq)) #First-differencing to show quarterly seasonality

(D)Sunspot Area (sunspotarea)

Descripton: Annual average sunspot area (1875-2015). Annual averages of the daily sunspot areas (in units of millionths of a hemisphere) for the full sun. Sunspots are magnetic regions that appear as dark spots on the surface of the sun. The Royal Greenwich Observatory compiled daily sunspot observations from May 1874 to 1976. Later data are from the US Air Force and the US National Oceanic and Atmospheric Administration. Source: NASA. from fpp2 package.

Findings: * Stationarity (no apparent change in distribution function over time - constant mean and standard deviation)
* 11-year cycle (strongly supported by external research) - The 11-year sunspot cycles are caused by the sun’s rotation in space, according to NASA, without the sun switching its magnetic polarity. # https://www.space.com/new-sunspot-solar-cycle-begins.html
* Seasonality not applicable with annual data (as such no ggseasonplot(sunspotarea) & ggsubseriesplot(sunspotarea) applied)
* Positive, wave-like, autocorrelation for first 15 lags (due to persistent upward trend)

autoplot(sunspotarea)

gglagplot(sunspotarea, lag=20) #Closer approximation to line at lag 10 and 11

ggAcf(sunspotarea, lag=25) #Spikes in autocorrelation at lags 1, 10, 11, and 21, 22

(E) Motor Gasoline Supply (gasoline)

Description: US finished motor gasoline product supplied. Weekly data beginning 2 February 1991, ending 20 January 2017. Units are “million barrels per day”. Source: US Energy Information Administration. from fpp2 package.

Findings: * Linear growth trend (especially from 1991-2007), which apparent change in trend line from 2007. The upward trend continues for 16 years until 2007, then recommences in 2014, albeit with great volatility.
* Non-stationarity (change in distribution function over time - increasing mean)
* Additive error (no apparent increase in volatility as values increase)
* Weekly seasonality, with lulls in production from weeks 5-11 (late winter) and spikes from 28-38 (late summer)
* The slow decrease in the ACF as the lags increase is due to the trend and lack of stationarity, while the “scalloped” shape is due the seasonality.

autoplot(gasoline)

ggseasonplot(gasoline, polar=TRUE)

#ggsubseriesplot(gasoline)
#gglagplot(gasoline)
ggAcf(gasoline, lag=156) #first 3 years

ggAcf(diff(gasoline), lag=156) #first 3 years