Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.
The predictability of an event or a quantity depends on several factors including:
how well we understand the factors that contribute to it;
how much data is available;
whether the forecasts can affect the thing we are trying to forecast.
For example, forecasts of electricity demand can be highly accurate because all three conditions are usually satisfied. We have a good idea of the contributing factors: electricity demand is driven largely by temperatures, with smaller effects for calendar variation such as holidays, and economic conditions.
On the other hand, when forecasting currency exchange rates, only one of the conditions is satisfied: there is plenty of available data. However, we have a limited understanding of the factors that affect exchange rates, and forecasts of the exchange rate have a direct effect on the rates themselves. If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.
Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again
Forecasting situations vary widely in their time horizons, factors determining actual outcomes, types of data patterns, and many other aspects. Forecasting methods can be simple, such as using the most recent observation as a forecast (which is called the naïve method), or highly complex, such as neural nets and econometric systems of simultaneous equations. Sometimes, there will be no data available at all. For example, we may wish to forecast the sales of a new product in its first year, but there are obviously no data to work with. In situations like this, we use judgmental forecasting.The choice of method depends on what data are available and the predictability of the quantity to be forecast.
Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning. However, business forecasting is often done poorly, and is frequently confused with planning and goals. They are three different things.
Forecasting is about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts.
Goals are what you would like to have happen. Goals should be linked to forecasts and plans, but this does not always occur. Too often, goals are set without any plan for how to achieve them, and no forecasts for whether they are realistic.
Planning is a response to forecasts and goals. Planning involves determining the appropriate actions that are required to make your forecasts match your goals.
Short-term forecasts are needed for the scheduling of personnel, production and transportation. As part of the scheduling process, forecasts of demand are often also required
Medium-term forecasts are needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.
Long-term forecasts are used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.
Examples of time series data include:
Annual Google profits
Quarterly sales results for Amazon
Monthly rainfall
Weekly retail sales
Daily IBM stock prices
Hourly electricity demand
Univariate Time Series is a sequence of measurements of the same variable collected over time. Most often, the measurements are made at regular time intervals.One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. One defining characteristic of a time series is that it is a list of observations where the ordering matters. Ordering is very important because there is dependency and changing the order could change the meaning of the data
The basic objective usually is to determine a model that describes the pattern of the time series. Uses for such a model are:
To describe the important features of the time series pattern.
To explain how the past affects the future or how two time series can “interact”.
To forecast future values of the series.
To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations.
There are two basic types of “time domain” models.
Models that relate the present value of a series to past values and past prediction errors - these are called ARIMA models (for Autoregressive Integrated Moving Average).
Ordinary regression models that use time indices as x-variables. These can be helpful for an initial description of the data and form the basis of several simple forecasting methods.
Some important questions to first consider when first looking at a time series are:
Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time?
Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on?
Are there outliers? In regression, outliers are far away from your line. With time series data, your outliers are far away from your other data.
Is there a long-run cycle or period unrelated to seasonality factors?
Is there constant variance over time, or is the variance non-constant?
Are there any abrupt changes to either the level of the series or the variance?
Predictor variables are often useful in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form
ED=f(current temperature, strength of economy, population,time of day, day of week, error).
The relationship is not exact — there will always be changes in electricity demand that cannot be accounted for by the predictor variables. The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. We call this an explanatory model because it helps explain what causes the variation in electricity demand.
Because the electricity demand data form a time series, we could also use a time series model for forecasting. In this case, a suitable time series forecasting equation is of the form
EDt+1=f(EDt,EDt−1,EDt−2,EDt−3,…,error) {t is the present hour, t+1 is the next hour, t−1 is the previous hour and so on. }
Here, prediction of the future is based on past values of a variable, but not on external variables which may affect the system. Again, the “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model.
Problem Definition : This step requires an understanding of the way the forecasts will be used, who requires the forecasts, and how the forecasting function fits within the organisation requiring the forecasts
Gathering Information : Two kinds of information required (a) statistical data, and (b) the accumulated expertise of the people who collect the data and use the forecasts. Often, it will be difficult to obtain enough historical data to be able to fit a good statistical model. In that case, the judgmental forecasting methods can be used. Occasionally, old data will be less useful due to structural changes in the system being forecast; then we may choose to use only the most recent data.
Preliminary Analysis (EDA) : Here we start by graphing the data and answering the following questions —
Are there consistent patterns?
Is there a significant trend?
Is seasonality important? Is there evidence of the presence of business cycles?
Are there any outliers in the data that need to be explained by those with expert knowledge?
How strong are the relationships among the variables available for analysis?
Choosing and Fitting Models : The model to use depends on the availability of historical data, the strength of relationships between the forecast variable and any explanatory variables, and the way in which the forecasts are to be used.These models include regression models , exponential smoothing methods , Box-Jenkins ARIMA models and several advanced methods including neural networks and vector autoregression.
Using and Evaluating Forecasting Models : Once a model has been selected and its parameters estimated, the model is used to make forecasts.A number of methods have been developed to help in assessing the accuracy of forecasts some of which include Root Mean Square Error (RMSE) , Mean Absolute Error (MAE) , Akaike’s Information Criterion (AIC) , Bayesian Information Criterion (BIC) and so on…
If you have to manage dates and times in any analytics role, it might get confusing pretty fast. Things like time zones, formats, leap year or leap seconds need to be considered.Depending on which type of task you want to perform with your date and time data, you have a lot of different tools available.We will take a look at standard tools in R base and then we dive deeper into lubricate, which is probably the most convenient way to handle this sort of data.
Main use cases of lubridate related to Time Series include :
Date and Time based calculations
Advanced format conversions
Changing time zone
Checking for leap years
# import library
library(lubridate)Let us parse a few dates applying the ymd method which take a vector as an argument.The ymd class of functions transform dates stored in character and numeric vectors to Date or POSIXct objects.These functions recognize arbitrary non-digit separators as well as no separator. As long as the order of formats is correct, these functions will parse dates correctly even when the input vectors contain differently formatted dates.
# Stored as Year-Month-Day
x = ymd(19931123)
print(x)## [1] "1993-11-23"
print(class(x))## [1] "Date"
# Stored as Day-Month-Year
y = dmy(23111993)
print(y)## [1] "1993-11-23"
print(class(y))## [1] "Date"
# Stored as Month-Day-Year
z = mdy(11231993)
print(z)## [1] "1993-11-23"
print(class(z))## [1] "Date"
The data for the time series is stored in an R object called time-series object. It is also a R data object like a vector or data frame. The time series object is created by using the ts() function.
While you can have data containing dates and corresponding values in an R object of any other class such as a data frame, creating objects of ts class offers many benefits such as the time index information. Also, when you plot a ts object, it automatically creates a plot over time
Syntax : timeseries.object.name <- ts(data, start, end, frequency)
The function ts() takes in three arguments:
data is set to everything in
the dataframe / vector except for the date column; it isn’t
needed since the ts object will store time information
separately.
start indicates the start time of the first
observation {1st available value of the 1st
cycle}
frequency is set to indicate periodicity of the
observations{# of observations per cycle}
The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as below −
frequency = 12 pegs the data points for every month of a year.
frequency = 4 pegs the data points for every quarter of a year.
frequency = 6 pegs the data points for every 10 minutes of an hour.
frequency = 24*6 pegs the data points for every 10 minutes of a day.
frequency = 52 pegs the data points for every week for a 52 week period
If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25). Similarly, data that are observed every minute might have an hourly seasonality (frequency=60), a daily seasonality (frequency=24×60=1440), a weekly seasonality (frequency=24×60×7=10080) and an annual seasonality (frequency=24×60×365.25=525960)
While specifying the start and frequency , care should be taken about the cycle and number of measurements per cycle.
Here is a list of all the libraries we need for creating , manipulating and analyzing Time Series
# load libraries
library(tidyverse) # meta package for manipulating and visualizing data frames
library(lubridate) # packagae for manipulating dates
library(readxl) # reading excel files
library(forecast) # meta package Time Series forecasting & Time Series Linear Models
library(fpp2) # data for "Forecasting: Principles and Practice" (
library(patchwork) # Let us create a Time-Series Data-set of 50 evenly distributed observations with a min of 10 and a max of 45. Let us assume the data is Quarterly Revenue starting with the year 1956.
# Create a Vector of 50 evenly spaced observations
mydata = runif(n = 50, min = 10, max = 45)# Convert Vector to Time-Series "ts" class
# Data starts in 1956 - 4 observations/year (quarterly)
mytimeseries = ts(data = mydata,
start = 1956, frequency = 4)Let us plot the Time Series
# Plot the Time Series
plot(mytimeseries)Let us check the class to validate if it is TS or not
# checcking class
class(mytimeseries)## [1] "ts"
Let us now check the “timestamp” for each data point
# checking timestamp
time(mytimeseries)## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 1956.00 1956.25 1956.50 1956.75
## 1957 1957.00 1957.25 1957.50 1957.75
## 1958 1958.00 1958.25 1958.50 1958.75
## 1959 1959.00 1959.25 1959.50 1959.75
## 1960 1960.00 1960.25 1960.50 1960.75
## 1961 1961.00 1961.25 1961.50 1961.75
## 1962 1962.00 1962.25 1962.50 1962.75
## 1963 1963.00 1963.25 1963.50 1963.75
## 1964 1964.00 1964.25 1964.50 1964.75
## 1965 1965.00 1965.25 1965.50 1965.75
## 1966 1966.00 1966.25 1966.50 1966.75
## 1967 1967.00 1967.25 1967.50 1967.75
## 1968 1968.00 1968.25
We have 4 Quarters every year
Let us read in some time series data from an xlsx file using
read_excel(), a function from the readxl
package, and store the data as a ts object
# Getting data
mydata = read_excel("exercise1.xlsx")Let us look at the Top 6 Rows of the data
# top 6 rows
head(mydata , n=6)There are four observations in 1981, indicating
quarterly data with a frequency of four rows per year. The
first observation or start date is Mar-81, the
first of four rows for year 1981, indicating that March
corresponds with the first period.
Let us convert it to a TS object by specifying start as 1981 and frequency as quarterly
# ts for class time series
# Data starts in 1956 - 4 observations/year (quarterly)
myts = ts(data = mydata[,2:4], ## All columns starting from 2nd column
start = c(1981,1), ## Start Year 1981 , Start Period 1st Quarter
frequency = 4) ## Quarterly hence freq = 4
class(myts)## [1] "mts" "ts" "matrix"
The first step in any data analysis task is to plot the data. Graphs enable you to visualize many features of the data, including patterns, unusual observations, changes over time, and relationships between variables. Just as the type of data determines which forecasting method to use, it also determines which graphs are appropriate.
You can use the autoplot() function to produce a
time plot of the data with or without facets, or panels
that display different subsets of data
example : autoplot(usnim_2002, facets = FALSE)
This will plot the “trend lines” for every time-series
The ggplot2 package serves our purpose for creating stunning Time Series visualizations.The latest version of the forecast package integrates ggplot2 to produce time series relevant plots.
A time series data set requires a line chart.Since this sort of chart implies a certain order along a timeline, the timeline is usually on the x axis i.e. the observations are plotted against the time of observation, with consecutive observations joined by straight lines.
Let us plot the data you stored as myts using
autoplot() with and without faceting.
# Plot data with faceting
autoplot(myts, facets = TRUE)autoplot(myts)When we use “faceting” , the Y-Axis is scaled by individual “Time Series Observations”
Without faceting , the Y-Axis has a constant Scale and individual Time Series can be compared
Let us look at the following three Time Series Data Sets (included in forecast package)
gold containing gold prices in US dollars
woolyrnq containing information on the production of
woollen yarn in Australia
gas containing Australian gas production
# Plot the three series
autoplot(gold)autoplot(woolyrnq)autoplot(gas)We will use two more functions in this exercise,
which.max() and frequency()
– which.max() can be used to identify the smallest index of
the maximum value.
– to find the number of observations per unit time, we use
frequency()
# Find the outlier in the gold series
goldoutlier <- which.max(gold)
goldoutlier## [1] 770
Observation#770 is the outlier as seen from the Time-Series Plot
# Look at the seasonal frequencies of the three series
frequency(gold)## [1] 1
frequency(woolyrnq)## [1] 4
frequency(gas)## [1] 12
Gold : This is an ANNUAL Time Series
Wool Yarn : This is a QUARTELRY Time Series
Gas Production : This is a WEEKLY Time Series
Along with time plots, there are other useful ways of plotting data to emphasize seasonal patterns and show changes in these patterns over time.
A seasonal plot is similar to a time plot except
that the data are plotted against the individual “seasons” in which the
data were observed. You can create one using the
ggseasonplot() function the same way you do with
autoplot().
An interesting variant of a season plot uses polar coordinates,
where the time axis is circular rather than horizontal; to make one,
simply add a polar argument and set it to
TRUE.
A subseries plot comprises mini time plots for each season. Here, the mean for each season is shown as a blue horizontal line.
One way of splitting a time series is by using the
window() function, which extracts a subset from the object
x observed between the times start and
end
example : window(x, start = NULL, end = NULL).
Let load the fpp2 package and use two of its
datasets:
a10 contains monthly sales volumes for anti-diabetic
drugs in Australia. In the plots, can you see which month has the
highest sales volume each year? What is unusual about the results in
March and April 2008?
ausbeer which contains quarterly beer production for
Australia. What is happening to the beer production in Quarter
4?
These examples will help you to visualize these plots and understand how they can be useful
# Load Library fpp2
library(fpp2)# Create plots of the a10 data
autoplot(a10)ggseasonplot(a10 , year.labels=TRUE, year.labels.left=TRUE) +
ylab("$ million") +
ggtitle("Seasonal plot: antidiabetic drug sales")There is an increase in drug sales every year as indicated by the 1st plot.
There is a large jump in sales in January each year , these are probably sales in late December as customers stockpile before the end of the calendar year, but not registered with the government until a week or two later.
# Produce a polar coordinate season plot for the a10 data
ggseasonplot(a10, polar = TRUE)# Create plots of the ausbeer data
autoplot(ausbeer)Sales have steadily increased. Starting 1980 , the sales oscillate around 500.
Let us now restrict the ausbeer series to start at 1992 using the window function
beer = window(ausbeer , start = 1980)
frequency(beer)## [1] 4
# Make plots of the beer data
autoplot(beer)# Let us look at "subseries" plots of beer
ggsubseriesplot(beer)Q1 : Mean Production ~ 450 {Blue Line} ; Q2 : Mean Production ~ 410 {Blue Line}
Q3 : Mean Production ~ 425 {Blue Line) ; Q4 : Mean Production ~ 525 {Blue Line}
Let us look at nottem which is a time series object containing average air temperatures at Nottingham Castle in degrees Fahrenheit for 20 years.
# Time series specific month plots
ggmonthplot(nottem)In the monthplot() , each month has its own line and you can see how this particular month develops over the years i.e. each month has its own line chart. We also get the mean for each month denoted by the solid blue line.
Let us examine the Airpassengers dataset that is part of the forecast package
# examine top 5 rows
head(AirPassengers)## Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135
# examine characteristics of TS
print(class(AirPassengers))## [1] "ts"
print(start(AirPassengers))## [1] 1949 1
print(end(AirPassengers))## [1] 1960 12
print(frequency(AirPassengers))## [1] 12
summary(AirPassengers)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
This is a Time Series data with a Monthly Frequency starting at January-1949 and ending at December-1960
Maximum passengers : 622K ; Minimum Passengers : 104K
Let us plot the original data
# plot the Airpassengers TS
autoplot(AirPassengers)Increasing Trends Year of Year …. Seasonal variations exist (Mid Year , End of the Year)….
ggseasonplot(AirPassengers)ggmonthplot(AirPassengers)ggseasonplot(AirPassengers, polar=TRUE) +
ylab("Thousands") +
ggtitle("Polar seasonal plot: Air Passenger Traffic")June and July had recorded the highest number of passengers every year — Holiday Season
The volume of passengers has increased every year — Strong upward trend
Let us take one more example of the arrivals data set
which comprises quarterly international arrivals (in thousands) to
Australia from Japan, New Zealand, UK and the US.
We will use autoplot(), ggseasonplot() and
ggsubseriesplot() to compare the differences between the
arrivals from these four countries and spot any strange
observations.
# Examine top 5 rows
head(arrivals)## Japan NZ UK US
## 1981 Q1 14.763 49.140 45.266 32.316
## 1981 Q2 9.321 87.467 19.886 23.721
## 1981 Q3 10.166 85.841 24.839 24.533
## 1981 Q4 19.509 61.882 52.264 33.438
## 1982 Q1 17.117 42.045 53.636 33.527
## 1982 Q2 10.617 63.081 34.802 28.366
# Examine summary Statistics
summary(arrivals)## Japan NZ UK US
## Min. : 9.321 Min. : 37.04 Min. : 19.89 Min. : 23.72
## 1st Qu.: 74.135 1st Qu.: 96.52 1st Qu.: 53.89 1st Qu.: 63.95
## Median :135.461 Median :154.54 Median : 95.56 Median : 85.88
## Mean :122.080 Mean :170.59 Mean :106.86 Mean : 84.85
## 3rd Qu.:176.752 3rd Qu.:228.60 3rd Qu.:128.13 3rd Qu.:108.98
## Max. :227.641 Max. :330.81 Max. :269.29 Max. :136.09
# Plot seasonal Time Series (with and without facets)
p1= autoplot(arrivals)
p2 = autoplot(arrivals , facets = TRUE)
p1+p2# Subseries plots by country
p1 = ggsubseriesplot(arrivals[,1]) + ggtitle("Japanese Arrivals")
p2 = ggsubseriesplot(arrivals[,2]) + ggtitle("New Zealand Arrivals")
p3 = ggsubseriesplot(arrivals[,3]) + ggtitle("UK Arrivals")
p4 = ggsubseriesplot(arrivals[,4]) + ggtitle("USA Arrivals")
p1 + p2 + p3 + p4Japan : Arrivals into Australia increased till late 90s and then began to drop starting the turn of the millennium
New Zeland : Arrivals into Australia are increasing year on year with highest volume of passengers since 2005.
UK & USA : Arrivals into Australia increased @ 2000 and then on there seems to be a stagnation
Some time series patterns occur so frequently that we give them names.
TREND : A “trend” occurs when there is a long-term increase or decrease in the data. This is deliberately a little vague as a trend is not a well-defined mathematical term. But if we talk about a trend we mean a general tendency for the time series to go up over time, or down over time.
SEASONALITY : A “Seasonality” occurs when there is a regular pattern in the time series related to the calendar. For example, a yearly pattern, or a weekly pattern or a daily pattern. Whenever the behavior of a time series is influenced in a periodic manner by the calendar, we call it seasonal.
CYCLICITY : Seasonality should be distinguished from cyclic patterns. Cyclicity occurs when there are rises and falls that are not of a fixed period. For example, a business cycle might last 3 or 5 or 8 years between peaks or troughs. But a seasonal pattern is always of the same length
We need to distinguish between seasonal and cyclic patterns as very different time series models are used in each case.
Seasonal patterns have constant length, while cyclic patterns have variable length.
If both exist together, then the average length of the cyclic pattern is longer the length of the seasonal pattern.
The size of the cycles tends to be more variable than the size of the seasonal fluctuations. As a result, it is much harder to predict cyclic data than seasonal data.
Let us examine some trends
Monthly Australian electricity production
# Monthly Australian Electricity Production
autoplot(a10)It is clearly trended, with a change in the slope of the trend around 1990. It is also seasonal. Notice how the seasonal pattern changes a little over time with a little more volatility in the trough at the end of this period than at the beginning. There is no cyclic behavior visible in this graph.
Quarterly Australian clay brick production
# Quarterly Australian clay brick production
autoplot(bricksq)This shows both seasonality and cyclicity. The seasonality is seen by the small bumps, one each year. The cyclicity is seen by the longer ups and downs. For example, there was a recession in 1975, and another one in 1983, and then again in 1991. Between these recessions the series rises and falls. There is also some trend seen in this graph, particularly in the first half
Annual Canadian Lynx trappings 1821-1934
# Annual Canadian Lynx trappings 1821-1934
autoplot(lynx)Because this is annual data, it cannot be seasonal. The population of lynx rises when there is plenty of food, and when the food supply gets low, they stop breeding causing the population to plummet. The surviving lynx then have plenty of food, start to breed again, and the cycle continues. The length of these cycles varies from between 8 and 11 years. This is also a good example to show how variable the magnitude of cyclic patterns can be, with the largest peak being more than three times the size of the smallest peak
Another way to look at time series data is to plot each observation
against another observation that occurred some time previously by using
gglagplot(). For example, you could plot yt against yt−1.
This is called a lag plot because you are plotting the time series
against lags of itself.
Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.
There are several autocorrelation coefficients, corresponding to each panel in the lag plot. For example, r1r1 measures the relationship between ytyt and yt−1yt−1, r2r2 measures the relationship between ytyt and yt−2yt−2, and so on
The correlations associated with the lag plots form what is called
the autocorrelation function (ACF). The
ggAcf() function produces ACF plots sometimes known as a
correlogram
Let us look at Annual oil production (millions of tonnes), Saudi Arabia, 1965-2013
# Create an autoplot of the oil data
autoplot(oil)Let us plot the relationship between yt and yt−k, k=1,…,9 using gglagplot()
# Create a lag plot of oil data
gglagplot(oil)Let us plot the correlations between yt and yt−k, k=1,…,9 using ggAcf()
# Create an ACF plot of the oil data
ggAcf(oil)In this graph:
Looking at the lag plot , strongest correlation is exhibited at “Lag=1”.
Looking at ACF Plot , r1 is higher than for the other lags. This is due to the absence of seasonal pattern in the data
The dashed blue lines indicate whether the correlations are significantly different from zero
When data are either seasonal or cyclic, the ACF will peak around the seasonal lags or ACF will peak at the average cycle length.
Let us investigate this phenomenon by plotting the annual sunspot
series (which follows the solar cycle of approximately 10-11 years) in
sunspot.year and the daily traffic to the Hyndsight blog
(which follows a 7-day weekly pattern) in hyndsight
# Plot the annual sunspot numbers
p1 <- autoplot(sunspot.year) + labs(title="Annual Sunspots")
p2 <- gglagplot(sunspot.year , lags = 20) + labs(title="Sunspots Lags")
p3 <- ggAcf(sunspot.year) + labs(title="Sunspots Correlogram")
p1p2p3# Exctact the Autocorrelation Coefficients
acf(sunspot.year , plot = FALSE)##
## Autocorrelations of series 'sunspot.year', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.814 0.447 0.043 -0.262 -0.408 -0.361 -0.158 0.141 0.436 0.607
## 11 12 13 14 15 16 17 18 19 20 21
## 0.604 0.435 0.168 -0.094 -0.281 -0.346 -0.298 -0.149 0.053 0.246 0.371
## 22 23 24
## 0.380 0.265 0.065
Examining the three plots we infer :
Maximum Correlation occurs at “Lag 1 , Lag 10 , Lag 11 , Lag 20 , Lag 21”
ACF Plot shows Maximum Correlation “Lag 1 , Lag 10 , Lag 11 , Lag 20 , Lag 21”
# Plot the annual sunspot numbers
p1 <- autoplot(hyndsight) + labs(title="Daily Traffic")
p2 <- gglagplot(hyndsight , lags = 8) + labs(title="Daily Traffic Lags")
p3 <- ggAcf(hyndsight) + labs(title="Daily Traffic Correlogram")
p1 p2 + p3# Exctact the Autocorrelation Coefficients
acf(hyndsight , plot = FALSE)##
## Autocorrelations of series 'hyndsight', by lag
##
## 0.000 0.143 0.286 0.429 0.571 0.714 0.857 1.000 1.143 1.286 1.429
## 1.000 0.651 0.231 0.030 0.001 0.193 0.544 0.734 0.532 0.164 -0.016
## 1.571 1.714 1.857 2.000 2.143 2.286 2.429 2.571 2.714 2.857 3.000
## -0.018 0.141 0.468 0.660 0.460 0.116 -0.045 -0.034 0.125 0.456 0.631
## 3.143 3.286 3.429 3.571
## 0.421 0.094 -0.071 -0.070
Examining the three plots we infer :
Maximum Correlation occurs at “Lag 7”
ACF Plot shows Maximum Correlation “Lag 7 , Lag 14 , Lag 21”
When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.
When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.
When data are both trended and seasonal, you see a combination of these effects.
Let us examine the “Monthly Anti-diabetic Drug Sales”
p1 <- autoplot(a10) + labs(title="Australian antidiabetic drug sales")
p2 <- ggAcf(a10) + labs(title="Correlogram of Australian antidiabetic drug sales")
p1 / p2This shows both trend and seasonality. The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality
Time series that show no autocorrelation are called white noise. It is just random, independent and identically distributed observations. In short-hand, statisticians often call this “iid data”. In other words, there is nothing going on - no trend, no seasonality, no cyclicity, not even any auto-correlations. Just randomness. It is a very important type of time series because it is the basis of almost all forecasting models.
The autocorrelation function of white noise consists of many insignificant spikes. Because the data is simply random, we expect correlations between observations to be close to zero. The dashed blue lines are there to show us how large a spike has to be before we can consider it significantly different from zero.
Each Auto-Correlaltion is “Close To Zero” & “95% of Auto-Correlations” should lie within the Blue Lines
# Correlogram of White Noise
ggAcf(wn) +
ggtitle("Sample ACF for white noise")In this example, the first 15 spikes are all within the blue lines, as we would expect. Even the largest spike at lag 10 is well within the range we would expect for a white noise series. The blue lines are based on the sampling distribution for autocorrelation assuming the data are white noise. Any spike within the blue lines should be ignored. Spikes outside the blue lines might indicate something interesting in the data. At least, they suggest there may be some information that we could use in building a forecasting model.
To Test if a Time-Series is “White Noise or Not” we use the Ljung-Box Test.
The Ljung-Box test considers the first “h” autocorrelation values together.
A significant test (small p-value) indicates the data are probably not white noise.
If we don’t have white noise, look at the ACF to see which spikes are the most significant.
Hyndman, R.J. & Athanasopoulos, Forecasting principles and practice, 2nd edition, https://otexts.com/fpp2
Udemy : Time Series Analysis and Forecasting in R : https://udemy.com/course/time-series-analysis-and-forecasting-in-r
Using R for Time Series Analysis https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
Applied Time Series Analysis STAT 510