1 Chapter 1 : Fundamentals

1.1 What can we forecast

Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.

The predictability of an event or a quantity depends on several factors including:

  1. how well we understand the factors that contribute to it;

  2. how much data is available;

  3. whether the forecasts can affect the thing we are trying to forecast.

For example, forecasts of electricity demand can be highly accurate because all three conditions are usually satisfied. We have a good idea of the contributing factors: electricity demand is driven largely by temperatures, with smaller effects for calendar variation such as holidays, and economic conditions.

On the other hand, when forecasting currency exchange rates, only one of the conditions is satisfied: there is plenty of available data. However, we have a limited understanding of the factors that affect exchange rates, and forecasts of the exchange rate have a direct effect on the rates themselves. If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.

Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again

Forecasting situations vary widely in their time horizons, factors determining actual outcomes, types of data patterns, and many other aspects. Forecasting methods can be simple, such as using the most recent observation as a forecast (which is called the naïve method), or highly complex, such as neural nets and econometric systems of simultaneous equations. Sometimes, there will be no data available at all. For example, we may wish to forecast the sales of a new product in its first year, but there are obviously no data to work with. In situations like this, we use judgmental forecasting.The choice of method depends on what data are available and the predictability of the quantity to be forecast.

1.2 Forecasting, planning and goals

Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning. However, business forecasting is often done poorly, and is frequently confused with planning and goals. They are three different things.

Forecasting is about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts.

Goals are what you would like to have happen. Goals should be linked to forecasts and plans, but this does not always occur. Too often, goals are set without any plan for how to achieve them, and no forecasts for whether they are realistic.

Planning is a response to forecasts and goals. Planning involves determining the appropriate actions that are required to make your forecasts match your goals.

Short-term forecasts are needed for the scheduling of personnel, production and transportation. As part of the scheduling process, forecasts of demand are often also required

Medium-term forecasts are needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.

Long-term forecasts are used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.

Examples of time series data include:

  • Annual Google profits

  • Quarterly sales results for Amazon

  • Monthly rainfall

  • Weekly retail sales

  • Daily IBM stock prices

  • Hourly electricity demand

1.3 Fundamentals of Time Series

Univariate Time Series is a sequence of measurements of the same variable collected over time. Most often, the measurements are made at regular time intervals.One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. One defining characteristic of a time series is that it is a list of observations where the ordering matters. Ordering is very important because there is dependency and changing the order could change the meaning of the data

1.3.1 Basic Objectives of the Analysis

The basic objective usually is to determine a model that describes the pattern of the time series. Uses for such a model are:

  1. To describe the important features of the time series pattern.

  2. To explain how the past affects the future or how two time series can “interact”.

  3. To forecast future values of the series.

  4. To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations.

1.3.2 Types of Models

There are two basic types of “time domain” models.

  1. Models that relate the present value of a series to past values and past prediction errors - these are called ARIMA models (for Autoregressive Integrated Moving Average).

  2. Ordinary regression models that use time indices as x-variables. These can be helpful for an initial description of the data and form the basis of several simple forecasting methods.

1.3.3 Important Characteristics

Some important questions to first consider when first looking at a time series are:

  • Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time?

  • Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on?

  • Are there outliers? In regression, outliers are far away from your line. With time series data, your outliers are far away from your other data.

  • Is there a long-run cycle or period unrelated to seasonality factors?

  • Is there constant variance over time, or is the variance non-constant?

  • Are there any abrupt changes to either the level of the series or the variance?

1.3.4 Predictor variables and time series forecasting

Predictor variables are often useful in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form

ED=f(current temperature, strength of economy, population,time of day, day of week, error).

The relationship is not exact — there will always be changes in electricity demand that cannot be accounted for by the predictor variables. The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. We call this an explanatory model because it helps explain what causes the variation in electricity demand.

Because the electricity demand data form a time series, we could also use a time series model for forecasting. In this case, a suitable time series forecasting equation is of the form

EDt+1=f(EDt,EDt−1,EDt−2,EDt−3,…,error) {t is the present hour, t+1 is the next hour, t−1 is the previous hour and so on. }

Here, prediction of the future is based on past values of a variable, but not on external variables which may affect the system. Again, the “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model.

1.3.5 Steps in Forecasting

  1. Problem Definition : This step requires an understanding of the way the forecasts will be used, who requires the forecasts, and how the forecasting function fits within the organisation requiring the forecasts

  2. Gathering Information : Two kinds of information required (a) statistical data, and (b) the accumulated expertise of the people who collect the data and use the forecasts. Often, it will be difficult to obtain enough historical data to be able to fit a good statistical model. In that case, the judgmental forecasting methods can be used. Occasionally, old data will be less useful due to structural changes in the system being forecast; then we may choose to use only the most recent data.

  3. Preliminary Analysis (EDA) : Here we start by graphing the data and answering the following questions —

    • Are there consistent patterns?

    • Is there a significant trend?

    • Is seasonality important? Is there evidence of the presence of business cycles?

    • Are there any outliers in the data that need to be explained by those with expert knowledge?

    • How strong are the relationships among the variables available for analysis?

  4. Choosing and Fitting Models : The model to use depends on the availability of historical data, the strength of relationships between the forecast variable and any explanatory variables, and the way in which the forecasts are to be used.These models include regression models , exponential smoothing methods , Box-Jenkins ARIMA models and several advanced methods including neural networks and vector autoregression.

  5. Using and Evaluating Forecasting Models : Once a model has been selected and its parameters estimated, the model is used to make forecasts.A number of methods have been developed to help in assessing the accuracy of forecasts some of which include Root Mean Square Error (RMSE) , Mean Absolute Error (MAE) , Akaike’s Information Criterion (AIC) , Bayesian Information Criterion (BIC) and so on…

1.3.6 Manipulating Dates : Lubridate

If you have to manage dates and times in any analytics role, it might get confusing pretty fast. Things like time zones, formats, leap year or leap seconds need to be considered.Depending on which type of task you want to perform with your date and time data, you have a lot of different tools available.We will take a look at standard tools in R base and then we dive deeper into lubricate, which is probably the most convenient way to handle this sort of data.

Main use cases of lubridate related to Time Series include :

  • Date and Time based calculations

  • Advanced format conversions

  • Changing time zone

  • Checking for leap years

# import library
library(lubridate)

Let us parse a few dates applying the ymd method which take a vector as an argument.The ymd class of functions transform dates stored in character and numeric vectors to Date or POSIXct objects.These functions recognize arbitrary non-digit separators as well as no separator. As long as the order of formats is correct, these functions will parse dates correctly even when the input vectors contain differently formatted dates.

1.3.6.1 Converting to POSIXt Class

# Stored as Year-Month-Day
x = ymd(19931123)
print(x)
## [1] "1993-11-23"
print(class(x))
## [1] "Date"
# Stored as Day-Month-Year
y = dmy(23111993)
print(y)
## [1] "1993-11-23"
print(class(y))
## [1] "Date"
# Stored as Month-Day-Year
z = mdy(11231993)
print(z)
## [1] "1993-11-23"
print(class(z))
## [1] "Date"

2 Chapter 2 : Data Preparation & Visualization

2.1 Creation of Time-Series(ts) object

The data for the time series is stored in an R object called time-series object. It is also a R data object like a vector or data frame. The time series object is created by using the ts() function.

While you can have data containing dates and corresponding values in an R object of any other class such as a data frame, creating objects of ts class offers many benefits such as the time index information. Also, when you plot a ts object, it automatically creates a plot over time

Syntax : timeseries.object.name <- ts(data, start, end, frequency)

The function ts() takes in three arguments:

  • data is set to everything in the dataframe / vector except for the date column; it isn’t needed since the ts object will store time information separately.

  • start indicates the start time of the first observation {1st available value of the 1st cycle}

  • frequency is set to indicate periodicity of the observations{# of observations per cycle}

The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as below −

  • frequency = 12 pegs the data points for every month of a year.

  • frequency = 4 pegs the data points for every quarter of a year.

  • frequency = 6 pegs the data points for every 10 minutes of an hour.

  • frequency = 24*6 pegs the data points for every 10 minutes of a day.

  • frequency = 52 pegs the data points for every week for a 52 week period

If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25). Similarly, data that are observed every minute might have an hourly seasonality (frequency=60), a daily seasonality (frequency=24×60=1440), a weekly seasonality (frequency=24×60×7=10080) and an annual seasonality (frequency=24×60×365.25=525960)

While specifying the start and frequency , care should be taken about the cycle and number of measurements per cycle.

Here is a list of all the libraries we need for creating , manipulating and analyzing Time Series

# load libraries
library(tidyverse) # meta package for manipulating and visualizing data frames
library(lubridate) # packagae for manipulating dates
library(readxl)    # reading excel files
library(forecast)  # meta package Time Series forecasting & Time Series Linear Models
library(fpp2)      # data for "Forecasting: Principles and Practice" (
library(patchwork) # 

2.1.1 Creating a Time-Series from scratch

Let us create a Time-Series Data-set of 50 evenly distributed observations with a min of 10 and a max of 45. Let us assume the data is Quarterly Revenue starting with the year 1956.

# Create a Vector of 50 evenly spaced observations
mydata = runif(n = 50, min = 10, max = 45)
# Convert Vector to Time-Series "ts" class
# Data starts in 1956 - 4 observations/year (quarterly)
mytimeseries = ts(data = mydata, 
                  start = 1956, frequency = 4)

Let us plot the Time Series

# Plot the Time Series
plot(mytimeseries)

Let us check the class to validate if it is TS or not

# checcking class
class(mytimeseries)
## [1] "ts"

Let us now check the “timestamp” for each data point

# checking timestamp
time(mytimeseries)
##         Qtr1    Qtr2    Qtr3    Qtr4
## 1956 1956.00 1956.25 1956.50 1956.75
## 1957 1957.00 1957.25 1957.50 1957.75
## 1958 1958.00 1958.25 1958.50 1958.75
## 1959 1959.00 1959.25 1959.50 1959.75
## 1960 1960.00 1960.25 1960.50 1960.75
## 1961 1961.00 1961.25 1961.50 1961.75
## 1962 1962.00 1962.25 1962.50 1962.75
## 1963 1963.00 1963.25 1963.50 1963.75
## 1964 1964.00 1964.25 1964.50 1964.75
## 1965 1965.00 1965.25 1965.50 1965.75
## 1966 1966.00 1966.25 1966.50 1966.75
## 1967 1967.00 1967.25 1967.50 1967.75
## 1968 1968.00 1968.25

We have 4 Quarters every year

2.1.2 Creating a Time-Series from Data-frame

Let us read in some time series data from an xlsx file using read_excel(), a function from the readxl package, and store the data as a ts object

# Getting data
mydata = read_excel("exercise1.xlsx")

Let us look at the Top 6 Rows of the data

# top 6 rows
head(mydata , n=6)

There are four observations in 1981, indicating quarterly data with a frequency of four rows per year. The first observation or start date is Mar-81, the first of four rows for year 1981, indicating that March corresponds with the first period.

Let us convert it to a TS object by specifying start as 1981 and frequency as quarterly

# ts for class time series
# Data starts in 1956 - 4 observations/year (quarterly)
myts = ts(data = mydata[,2:4],         ## All columns starting from 2nd column
                  start = c(1981,1),   ## Start Year 1981 , Start Period 1st Quarter
                  frequency = 4)       ## Quarterly hence freq = 4
class(myts)
## [1] "mts"    "ts"     "matrix"

2.2 Visualizing Time-Series(ts) object

The first step in any data analysis task is to plot the data. Graphs enable you to visualize many features of the data, including patterns, unusual observations, changes over time, and relationships between variables. Just as the type of data determines which forecasting method to use, it also determines which graphs are appropriate.

You can use the autoplot() function to produce a time plot of the data with or without facets, or panels that display different subsets of data

example : autoplot(usnim_2002, facets = FALSE)

This will plot the “trend lines” for every time-series

The ggplot2 package serves our purpose for creating stunning Time Series visualizations.The latest version of the forecast package integrates ggplot2 to produce time series relevant plots.

A time series data set requires a line chart.Since this sort of chart implies a certain order along a timeline, the timeline is usually on the x axis i.e. the observations are plotted against the time of observation, with consecutive observations joined by straight lines.

Let us plot the data you stored as myts using autoplot() with and without faceting.

# Plot data with faceting
autoplot(myts, facets = TRUE)

autoplot(myts)

When we use “faceting” , the Y-Axis is scaled by individual “Time Series Observations”

Without faceting , the Y-Axis has a constant Scale and individual Time Series can be compared

Let us look at the following three Time Series Data Sets (included in forecast package)

  • gold containing gold prices in US dollars

  • woolyrnq containing information on the production of woollen yarn in Australia

  • gas containing Australian gas production

# Plot the three series
autoplot(gold)

autoplot(woolyrnq)

autoplot(gas)

2.3 Outliers and Frequency of Time Series

We will use two more functions in this exercise, which.max() and frequency()
which.max() can be used to identify the smallest index of the maximum value.

– to find the number of observations per unit time, we use frequency()

# Find the outlier in the gold series
goldoutlier <- which.max(gold)
goldoutlier
## [1] 770

Observation#770 is the outlier as seen from the Time-Series Plot

# Look at the seasonal frequencies of the three series
frequency(gold)
## [1] 1
frequency(woolyrnq)
## [1] 4
frequency(gas)
## [1] 12

Gold : This is an ANNUAL Time Series

Wool Yarn : This is a QUARTELRY Time Series

Gas Production : This is a WEEKLY Time Series

2.4 Seasonal and Sub-Series Plot

Along with time plots, there are other useful ways of plotting data to emphasize seasonal patterns and show changes in these patterns over time.

  • A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. You can create one using the ggseasonplot() function the same way you do with autoplot().

  • An interesting variant of a season plot uses polar coordinates, where the time axis is circular rather than horizontal; to make one, simply add a polar argument and set it to TRUE.

  • A subseries plot comprises mini time plots for each season. Here, the mean for each season is shown as a blue horizontal line.

One way of splitting a time series is by using the window() function, which extracts a subset from the object x observed between the times start and end

example : window(x, start = NULL, end = NULL).

Let load the fpp2 package and use two of its datasets:

  • a10 contains monthly sales volumes for anti-diabetic drugs in Australia. In the plots, can you see which month has the highest sales volume each year? What is unusual about the results in March and April 2008?

  • ausbeer which contains quarterly beer production for Australia. What is happening to the beer production in Quarter 4?

These examples will help you to visualize these plots and understand how they can be useful

# Load Library fpp2
library(fpp2)
# Create plots of the a10 data
autoplot(a10)

ggseasonplot(a10 , year.labels=TRUE, year.labels.left=TRUE) +
  ylab("$ million") +
  ggtitle("Seasonal plot: antidiabetic drug sales")

There is an increase in drug sales every year as indicated by the 1st plot.

There is a large jump in sales in January each year , these are probably sales in late December as customers stockpile before the end of the calendar year, but not registered with the government until a week or two later.

# Produce a polar coordinate season plot for the a10 data
ggseasonplot(a10, polar = TRUE)

# Create plots of the ausbeer data
autoplot(ausbeer)

Sales have steadily increased. Starting 1980 , the sales oscillate around 500.

Let us now restrict the ausbeer series to start at 1992 using the window function

beer = window(ausbeer , start = 1980)
frequency(beer)
## [1] 4
# Make plots of the beer data
autoplot(beer)

# Let us look at "subseries" plots of beer
ggsubseriesplot(beer)

Q1 : Mean Production ~ 450 {Blue Line} ; Q2 : Mean Production ~ 410 {Blue Line}

Q3 : Mean Production ~ 425 {Blue Line) ; Q4 : Mean Production ~ 525 {Blue Line}

Let us look at nottem which is a time series object containing average air temperatures at Nottingham Castle in degrees Fahrenheit for 20 years.

# Time series specific month plots
ggmonthplot(nottem)

In the monthplot() , each month has its own line and you can see how this particular month develops over the years i.e. each month has its own line chart. We also get the mean for each month denoted by the solid blue line.

Let us examine the Airpassengers dataset that is part of the forecast package

# examine top 5 rows
head(AirPassengers)
##      Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135
# examine characteristics of TS
print(class(AirPassengers))
## [1] "ts"
print(start(AirPassengers))
## [1] 1949    1
print(end(AirPassengers))
## [1] 1960   12
print(frequency(AirPassengers))
## [1] 12
summary(AirPassengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

This is a Time Series data with a Monthly Frequency starting at January-1949 and ending at December-1960

Maximum passengers : 622K ; Minimum Passengers : 104K

Let us plot the original data

# plot the Airpassengers TS
autoplot(AirPassengers)

Increasing Trends Year of Year …. Seasonal variations exist (Mid Year , End of the Year)….

ggseasonplot(AirPassengers)

ggmonthplot(AirPassengers)

ggseasonplot(AirPassengers, polar=TRUE) +
  ylab("Thousands") +
  ggtitle("Polar seasonal plot: Air Passenger Traffic")

June and July had recorded the highest number of passengers every year — Holiday Season

The volume of passengers has increased every year — Strong upward trend

Let us take one more example of the arrivals data set which comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.

We will use autoplot(), ggseasonplot() and ggsubseriesplot() to compare the differences between the arrivals from these four countries and spot any strange observations.

# Examine top 5 rows
head(arrivals)
##          Japan     NZ     UK     US
## 1981 Q1 14.763 49.140 45.266 32.316
## 1981 Q2  9.321 87.467 19.886 23.721
## 1981 Q3 10.166 85.841 24.839 24.533
## 1981 Q4 19.509 61.882 52.264 33.438
## 1982 Q1 17.117 42.045 53.636 33.527
## 1982 Q2 10.617 63.081 34.802 28.366
# Examine summary Statistics
summary(arrivals)
##      Japan               NZ               UK               US        
##  Min.   :  9.321   Min.   : 37.04   Min.   : 19.89   Min.   : 23.72  
##  1st Qu.: 74.135   1st Qu.: 96.52   1st Qu.: 53.89   1st Qu.: 63.95  
##  Median :135.461   Median :154.54   Median : 95.56   Median : 85.88  
##  Mean   :122.080   Mean   :170.59   Mean   :106.86   Mean   : 84.85  
##  3rd Qu.:176.752   3rd Qu.:228.60   3rd Qu.:128.13   3rd Qu.:108.98  
##  Max.   :227.641   Max.   :330.81   Max.   :269.29   Max.   :136.09
# Plot seasonal Time Series (with and without facets)
p1= autoplot(arrivals)
p2 = autoplot(arrivals , facets = TRUE)
p1+p2

# Subseries plots by country
p1 = ggsubseriesplot(arrivals[,1]) + ggtitle("Japanese Arrivals")
p2 = ggsubseriesplot(arrivals[,2]) + ggtitle("New Zealand Arrivals")
p3 = ggsubseriesplot(arrivals[,3]) + ggtitle("UK Arrivals")
p4 = ggsubseriesplot(arrivals[,4]) + ggtitle("USA Arrivals")

p1 + p2 + p3 + p4

  • Japan : Arrivals into Australia increased till late 90s and then began to drop starting the turn of the millennium

  • New Zeland : Arrivals into Australia are increasing year on year with highest volume of passengers since 2005.

  • UK & USA : Arrivals into Australia increased @ 2000 and then on there seems to be a stagnation

4 Autocorrelation

4.1 Autocorrelation of non-seasonal time series

Another way to look at time series data is to plot each observation against another observation that occurred some time previously by using gglagplot(). For example, you could plot yt against yt−1. This is called a lag plot because you are plotting the time series against lags of itself.

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

There are several autocorrelation coefficients, corresponding to each panel in the lag plot. For example, r1r1 measures the relationship between ytyt and yt−1yt−1, r2r2 measures the relationship between ytyt and yt−2yt−2, and so on

The correlations associated with the lag plots form what is called the autocorrelation function (ACF). The ggAcf() function produces ACF plots sometimes known as a correlogram

Let us look at Annual oil production (millions of tonnes), Saudi Arabia, 1965-2013

# Create an autoplot of the oil data
autoplot(oil)

Let us plot the relationship between yt and yt−k, k=1,…,9 using gglagplot()

# Create a lag plot of oil data
gglagplot(oil)

Let us plot the correlations between yt and yt−k, k=1,…,9 using ggAcf()

# Create an ACF plot of the oil data
ggAcf(oil)

In this graph:

  • Looking at the lag plot , strongest correlation is exhibited at “Lag=1”.

  • Looking at ACF Plot , r1 is higher than for the other lags. This is due to the absence of seasonal pattern in the data

  • The dashed blue lines indicate whether the correlations are significantly different from zero

4.2 Autocorrelation of seasonal and cyclic time series

When data are either seasonal or cyclic, the ACF will peak around the seasonal lags or ACF will peak at the average cycle length.

Let us investigate this phenomenon by plotting the annual sunspot series (which follows the solar cycle of approximately 10-11 years) in sunspot.year and the daily traffic to the Hyndsight blog (which follows a 7-day weekly pattern) in hyndsight

# Plot the annual sunspot numbers
p1 <- autoplot(sunspot.year) + labs(title="Annual Sunspots")
p2 <- gglagplot(sunspot.year , lags = 20) + labs(title="Sunspots Lags")
p3 <- ggAcf(sunspot.year) + labs(title="Sunspots Correlogram")
p1

p2

p3

# Exctact the Autocorrelation Coefficients
acf(sunspot.year , plot = FALSE)
## 
## Autocorrelations of series 'sunspot.year', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000  0.814  0.447  0.043 -0.262 -0.408 -0.361 -0.158  0.141  0.436  0.607 
##     11     12     13     14     15     16     17     18     19     20     21 
##  0.604  0.435  0.168 -0.094 -0.281 -0.346 -0.298 -0.149  0.053  0.246  0.371 
##     22     23     24 
##  0.380  0.265  0.065

Examining the three plots we infer :

  1. Maximum Correlation occurs at “Lag 1 , Lag 10 , Lag 11 , Lag 20 , Lag 21”

  2. ACF Plot shows Maximum Correlation “Lag 1 , Lag 10 , Lag 11 , Lag 20 , Lag 21”

# Plot the annual sunspot numbers
p1 <- autoplot(hyndsight) + labs(title="Daily Traffic")
p2 <- gglagplot(hyndsight , lags = 8) + labs(title="Daily Traffic Lags")
p3 <- ggAcf(hyndsight) + labs(title="Daily Traffic Correlogram")
p1 

p2 + p3

# Exctact the Autocorrelation Coefficients
acf(hyndsight , plot = FALSE)
## 
## Autocorrelations of series 'hyndsight', by lag
## 
##  0.000  0.143  0.286  0.429  0.571  0.714  0.857  1.000  1.143  1.286  1.429 
##  1.000  0.651  0.231  0.030  0.001  0.193  0.544  0.734  0.532  0.164 -0.016 
##  1.571  1.714  1.857  2.000  2.143  2.286  2.429  2.571  2.714  2.857  3.000 
## -0.018  0.141  0.468  0.660  0.460  0.116 -0.045 -0.034  0.125  0.456  0.631 
##  3.143  3.286  3.429  3.571 
##  0.421  0.094 -0.071 -0.070

Examining the three plots we infer :

  1. Maximum Correlation occurs at “Lag 7”

  2. ACF Plot shows Maximum Correlation “Lag 7 , Lag 14 , Lag 21”

4.3 Autocorrelation of Trend and seasonality time series

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.

When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.

When data are both trended and seasonal, you see a combination of these effects.

Let us examine the “Monthly Anti-diabetic Drug Sales

p1 <- autoplot(a10) + labs(title="Australian antidiabetic drug sales")

p2 <- ggAcf(a10) + labs(title="Correlogram of Australian antidiabetic drug sales")

p1 / p2

This shows both trend and seasonality. The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality

4.4 White Noise

Time series that show no autocorrelation are called white noise. It is just random, independent and identically distributed observations. In short-hand, statisticians often call this “iid data”. In other words, there is nothing going on - no trend, no seasonality, no cyclicity, not even any auto-correlations. Just randomness. It is a very important type of time series because it is the basis of almost all forecasting models.

The autocorrelation function of white noise consists of many insignificant spikes. Because the data is simply random, we expect correlations between observations to be close to zero. The dashed blue lines are there to show us how large a spike has to be before we can consider it significantly different from zero.

Each Auto-Correlaltion is “Close To Zero” & “95% of Auto-Correlations” should lie within the Blue Lines

# Correlogram of White Noise
ggAcf(wn) +
ggtitle("Sample ACF for white noise")

In this example, the first 15 spikes are all within the blue lines, as we would expect. Even the largest spike at lag 10 is well within the range we would expect for a white noise series. The blue lines are based on the sampling distribution for autocorrelation assuming the data are white noise. Any spike within the blue lines should be ignored. Spikes outside the blue lines might indicate something interesting in the data. At least, they suggest there may be some information that we could use in building a forecasting model.

To Test if a Time-Series is “White Noise or Not” we use the Ljung-Box Test.

The Ljung-Box test considers the first “h” autocorrelation values together.

A significant test (small p-value) indicates the data are probably not white noise.

If we don’t have white noise, look at the ACF to see which spikes are the most significant.

5 References