1 Introduction

1.1 What can we forecast

Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.

The predictability of an event or a quantity depends on several factors including:

  1. how well we understand the factors that contribute to it;

  2. how much data is available;

  3. whether the forecasts can affect the thing we are trying to forecast.

For example, forecasts of electricity demand can be highly accurate because all three conditions are usually satisfied. We have a good idea of the contributing factors: electricity demand is driven largely by temperatures, with smaller effects for calendar variation such as holidays, and economic conditions.

On the other hand, when forecasting currency exchange rates, only one of the conditions is satisfied: there is plenty of available data. However, we have a limited understanding of the factors that affect exchange rates, and forecasts of the exchange rate have a direct effect on the rates themselves. If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.

Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again

Forecasting situations vary widely in their time horizons, factors determining actual outcomes, types of data patterns, and many other aspects. Forecasting methods can be simple, such as using the most recent observation as a forecast (which is called the naïve method), or highly complex, such as neural nets and econometric systems of simultaneous equations. Sometimes, there will be no data available at all. For example, we may wish to forecast the sales of a new product in its first year, but there are obviously no data to work with. In situations like this, we use judgmental forecasting.The choice of method depends on what data are available and the predictability of the quantity to be forecast.

1.2 Forecasting, planning and goals

Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning. However, business forecasting is often done poorly, and is frequently confused with planning and goals. They are three different things.

Forecasting is about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts.

Goals are what you would like to have happen. Goals should be linked to forecasts and plans, but this does not always occur. Too often, goals are set without any plan for how to achieve them, and no forecasts for whether they are realistic.

Planning is a response to forecasts and goals. Planning involves determining the appropriate actions that are required to make your forecasts match your goals.

Short-term forecasts are needed for the scheduling of personnel, production and transportation. As part of the scheduling process, forecasts of demand are often also required

Medium-term forecasts are needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.

Long-term forecasts are used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.

Examples of time series data include:

  • Annual Google profits

  • Quarterly sales results for Amazon

  • Monthly rainfall

  • Weekly retail sales

  • Daily IBM stock prices

  • Hourly electricity demand

1.3 Fundamentals of Time Series

Univariate Time Series is a sequence of measurements of the same variable collected over time. Most often, the measurements are made at regular time intervals.One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. One defining characteristic of a time series is that it is a list of observations where the ordering matters. Ordering is very important because there is dependency and changing the order could change the meaning of the data

1.3.1 Basic Objectives of the Analysis

The basic objective usually is to determine a model that describes the pattern of the time series. Uses for such a model are:

  1. To describe the important features of the time series pattern.

  2. To explain how the past affects the future or how two time series can “interact”.

  3. To forecast future values of the series.

  4. To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations.

1.3.2 Types of Models

There are two basic types of “time domain” models.

  1. Models that relate the present value of a series to past values and past prediction errors - these are called ARIMA models (for Autoregressive Integrated Moving Average).

  2. Ordinary regression models that use time indices as x-variables. These can be helpful for an initial description of the data and form the basis of several simple forecasting methods.

1.3.3 Important Characteristics

Some important questions to first consider when first looking at a time series are:

  • Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time?

  • Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on?

  • Are there outliers? In regression, outliers are far away from your line. With time series data, your outliers are far away from your other data.

  • Is there a long-run cycle or period unrelated to seasonality factors?

  • Is there constant variance over time, or is the variance non-constant?

  • Are there any abrupt changes to either the level of the series or the variance?

1.3.4 Predictor variables and time series forecasting

Predictor variables are often useful in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form

ED=f(current temperature, strength of economy, population,time of day, day of week, error).

The relationship is not exact — there will always be changes in electricity demand that cannot be accounted for by the predictor variables. The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. We call this an explanatory model because it helps explain what causes the variation in electricity demand.

Because the electricity demand data form a time series, we could also use a time series model for forecasting. In this case, a suitable time series forecasting equation is of the form

EDt+1=f(EDt,EDt−1,EDt−2,EDt−3,…,error) {t is the present hour, t+1 is the next hour, t−1 is the previous hour and so on. }

Here, prediction of the future is based on past values of a variable, but not on external variables which may affect the system. Again, the “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model.

1.3.5 Steps in Forecasting

  1. Problem Definition : This step requires an understanding of the way the forecasts will be used, who requires the forecasts, and how the forecasting function fits within the organisation requiring the forecasts

  2. Gathering Information : Two kinds of information required (a) statistical data, and (b) the accumulated expertise of the people who collect the data and use the forecasts. Often, it will be difficult to obtain enough historical data to be able to fit a good statistical model. In that case, the judgmental forecasting methods can be used. Occasionally, old data will be less useful due to structural changes in the system being forecast; then we may choose to use only the most recent data.

  3. Preliminary Analysis (EDA) : Here we start by graphing the data and answering the following questions —

    • Are there consistent patterns?

    • Is there a significant trend?

    • Is seasonality important? Is there evidence of the presence of business cycles?

    • Are there any outliers in the data that need to be explained by those with expert knowledge?

    • How strong are the relationships among the variables available for analysis?

  4. Choosing and Fitting Models : The model to use depends on the availability of historical data, the strength of relationships between the forecast variable and any explanatory variables, and the way in which the forecasts are to be used.These models include regression models , exponential smoothing methods , Box-Jenkins ARIMA models and several advanced methods including neural networks and vector autoregression.

  5. Using and Evaluating Forecasting Models : Once a model has been selected and its parameters estimated, the model is used to make forecasts.A number of methods have been developed to help in assessing the accuracy of forecasts some of which include Root Mean Square Error (RMSE) , Mean Absolute Error (MAE) , Akaike’s Information Criterion (AIC) , Bayesian Information Criterion (BIC) and so on…

2 Dates and Times in R

If you have to manage dates and times in any analytics role, it might get confusing pretty fast. Things like time zones, formats, leap year or leap seconds need to be considered.Depending on which type of task you want to perform with your date and time data, you have a lot of different tools available.We will take a look at standard tools in R base and then we dive deeper into lubricate, which is probably the most convenient way to handle this sort of data.

2.1 POSIXct , POSIXlt and Date Classes

Generally, there are several classes available that can handle dates and time, the one best known is POSIX t(Portable Operating System Interface for time.)This class essentially is an encoding system, which makes sure that the underlying time data is recognized by all standard computer operating systems.

The major class of POSIXt is divided into POSIXct and POSIXlt. Both are able to handle time zones, dates and time for you.

Let us compare these two classes

# POSIXt classes in R
x <- as.POSIXct("2019-12-25 11:45:34")
y <- as.POSIXlt("2019-12-25 11:45:34")
print(x)
## [1] "2019-12-25 11:45:34 IST"
print(y)
## [1] "2019-12-25 11:45:34 IST"
# unclass POSIXct
print(unclass(x))
## [1] 1577254534
## attr(,"tzone")
## [1] ""
# unclass POSIXlt
print(unclass(y))
## $sec
## [1] 34
## 
## $min
## [1] 45
## 
## $hour
## [1] 11
## 
## $mday
## [1] 25
## 
## $mon
## [1] 11
## 
## $year
## [1] 119
## 
## $wday
## [1] 3
## 
## $yday
## [1] 358
## 
## $isdst
## [1] 0
## 
## $zone
## [1] "IST"
## 
## $gmtoff
## [1] NA

While POSIXct gives us a simple number, POSIXlt consists of a whole list of aliments.

The number in ct is an amount of seconds. It has a reference point at the zero second of the first day of 1970 (01-01-1970 00:00:00) and R calculates the amount of seconds from that time point onward.The number 1,577,254,534 implies that ~1.5 Billion seconds have passed since the Birth Date of Time 01-01-1970 00:00:00

POSIXlt on the other hand, consists of all those elements which you can conveniently extract with a dollar sign in typical our style, like, for example, the time zone over here, extraction

# Extracting the Time Zone
y$zone
## [1] "IST"

Related to these POSIXt classes, we have several other classes to handle dates and time.A very important one is the date class and this one has again a reference time at the beginning of 1970.But in this case, it calculates in days instead of seconds.

# Date Class
a = as.Date("2019-12-25")
print(a)
## [1] "2019-12-25"
print(class(a))
## [1] "Date"
print(unclass(a))
## [1] 18255

On un-classing (a) it gives us again a simple number, which in this case is 18,255 days since 1970.

To summarize :

  • as.POSIXct passed seconds since the birth of second (01-01-1970 00:00:00)

  • as.POSIXlt gives date and time components

  • Both POSIXct and POSIXlt account for date , time and time zones

  • as.Date passed days since the birth day (01-01-1970)

2.2 Format Conversion : Strings to Date/Time strptime()

When we are importing date and time data in a data frame, this is usually done as character, at least that is what are will recognize it as this is due to the fact that date and time comes in so many different varieties that it is, in fact hard to automatically recognize it.Therefore, you actually need a tool to convert it from character into a date time column.Historically, in R this has been done with strptime()

The key inputs to strptime() function are :

  • x : A vector, which is mostly a column in a data frame.

  • format : A character string which provides the information on how the character should be converted into date time.The default for the format methods is “%Y-%m-%d %H:%M:%S”.With this format, you tell, R the order of the years, months, days, hours, minutes and seconds into character.But not only that, you also tell our how these components look like and how they are separated.Refer strptime() help for more details.

Let’s say we have a character vector “a” which has 3 values which are Dates but in string format

# Dates as Strings
a = as.character(c("1993-12-30 23:45",

                   "1994-11-05 11:43",

                   "1992-03-09 21:54"))
print(a)
## [1] "1993-12-30 23:45" "1994-11-05 11:43" "1992-03-09 21:54"
print(class(a))
## [1] "character"

To convert these to Date Class we will first specify the format in which we want the dates e.g. %Y for 4 Digit Year , %m for Decimal Month and %d for 2 digit Days and convert to POSIXt class

# Convert to desired POSIXt class format
b = strptime(a , format = "%Y-%m-%d %H:%M")

print(class(b))
## [1] "POSIXlt" "POSIXt"

To Summarize :

  • strptime() converts strings to date time and vice versa.

  • Strings (Vectors / Columns) to be converted need to be in a uniform format.

  • Review the format of existing date values before conversion

2.3 Manipulating Dates : Lubridate

Main use cases of lubridate related to Time Series include :

  • Date and Time based calculations

  • Advanced format conversions

  • Changing time zone

  • Checking for leap years

# import library
library(lubridate)

Let us parse a few dates applying the ymd method which take a vector as an argument.The ymd class of functions tansform dates stored in character and numeric vectors to Date or POSIXct objects.These functions recognize arbitrary non-digit separators as well as no separator. As long as the order of formats is correct, these functions will parse dates correctly even when the input vectors contain differently formatted dates.

2.3.0.1 Converting to POSIXt Class

# Stored as Year-Month-Day
x = ymd(19931123)
print(x)
## [1] "1993-11-23"
print(class(x))
## [1] "Date"
# Stored as Day-Month-Year
y = dmy(23111993)
print(y)
## [1] "1993-11-23"
print(class(y))
## [1] "Date"
# Stored as Month-Day-Year
z = mdy(11231993)
print(z)
## [1] "1993-11-23"
print(class(z))
## [1] "Date"

2.3.0.2 Extracting Features after Conversion

Once the Date column is converted to Date Class one can extract sub elements aka minute , month , day , year and so on…

mytimepoint <- ymd_hm("1993-11-23 11:23", tz = "Europe/Prague")

mytimepoint
## [1] "1993-11-23 11:23:00 CET"
# Extract minute
minute(mytimepoint)
## [1] 23
# Extract day
day(mytimepoint)
## [1] 23
# Extract month
month(mytimepoint)
## [1] 11
# Extract year
year(mytimepoint)
## [1] 1993
# Extract WeekDay
wday(mytimepoint , label = TRUE , abbr = FALSE)
## [1] Tuesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Let us understand the lubridate concepts through a simple example where we will create a Dataframe that has “Date” , “Time” and “Measurement” columns that will look like the one below. The time colum is CET time zone based.

We will start by creating individual vectors for date and time which has data in different formats

# Vector of dates
a = c("1998,11,11", "1983/01/23", "1982:09:04", "1945-05-09", 19821224, "1974.12.03", 19871210)
a
## [1] "1998,11,11" "1983/01/23" "1982:09:04" "1945-05-09" "19821224"  
## [6] "1974.12.03" "19871210"
# vector of time
b = c("22 4 5", "04;09;45", "11:9:56", "23,15,12", "14 16 34", "8 8 23", "21 16 14")
b
## [1] "22 4 5"   "04;09;45" "11:9:56"  "23,15,12" "14 16 34" "8 8 23"   "21 16 14"

We will now convert the date vector Date Class with time zone CET

# Convert date vector to date class
a = ymd(a , tz = "CET")
print(a)
## [1] "1998-11-11 CET"  "1983-01-23 CET"  "1982-09-04 CEST" "1945-05-09 CEST"
## [5] "1982-12-24 CET"  "1974-12-03 CET"  "1987-12-10 CET"
print(class(a))
## [1] "POSIXct" "POSIXt"

We will now convert the time vector to Hour Minute Second

# Convert time vector to hms
b = hms(b)
print(b)
## [1] "22H 4M 5S"   "4H 9M 45S"   "11H 9M 56S"  "23H 15M 12S" "14H 16M 34S"
## [6] "8H 8M 23S"   "21H 16M 14S"
print(class(b))
## [1] "Period"
## attr(,"package")
## [1] "lubridate"

Now create the 3rd column with 7 random numbers between with mean 10

# create the measurement vector with mean 10
f = rnorm(7,10) 
# round the digits to 2 decimals
f = round(f, digits = 2)
# print the vector
print(f)
## [1]  7.55 11.41  9.88 10.71 11.39 10.47 10.13

We will now combine all 3 vectors into one data frame using cbind.data.frame method and add the column names

# Create the dataframe
mydf = cbind.data.frame(Date = a , Time = b , Measurement = f)
mydf

3 Time Series Data Pre-processing and Visualization

3.1 Creation of TimeSeries(ts) object

The data for the time series is stored in an R object called time-series object. It is also a R data object like a vector or data frame. The time series object is created by using the ts() function.

While you can have data containing dates and corresponding values in an R object of any other class such as a dataframe, creating objects of ts class offers many benefits such as the time index information. Also, when you plot a ts object, it automatically creates a plot over time

Syntax : timeseries.object.name <- ts(data, start, end, frequency)

Following is the description of the parameters used − except the parameter “data” all other parameters are optional.

  • data is a vector or matrix containing the values used in the time series.

  • start specifies the start time for the first observation in time series.

  • end specifies the end time for the last observation in time series.

  • frequency specifies the number of observations per unit time per cycle.

The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as below −

  • frequency = 12 pegs the data points for every month of a year.

  • frequency = 4 pegs the data points for every quarter of a year.

  • frequency = 6 pegs the data points for every 10 minutes of an hour.

  • frequency = 24*6 pegs the data points for every 10 minutes of a day.

  • frequency = 52 pegs the data points for every week for a 52 week period

If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25). Similarly, data that are observed every minute might have an hourly seasonality (frequency=60), a daily seasonality (frequency=24×60=1440), a weekly seasonality (frequency=24×60×7=10080) and an annual seasonality (frequency=24×60×365.25=525960)

While specifying the start and frequency , care should be taken about the cycle and number of measurements per cycle.

Let us create a fictitious time series dataset

# Getting data
mydata = runif(n = 50, min = 10, max = 45)

Let us convert it to a TS object by specifying start as 1956 and frequency as quarterly

# ts for class time series
# Data starts in 1956 - 4 observations/year (quarterly)
mytimeseries = ts(data = mydata, 
                  start = 1956, frequency = 4)
class(mytimeseries)
## [1] "ts"

Let us plot our time series

# Lets see how the data looks
plot(mytimeseries)

# Checking the class
class(mytimeseries)
## [1] "ts"
# Checking the timestamp
time(mytimeseries)
##         Qtr1    Qtr2    Qtr3    Qtr4
## 1956 1956.00 1956.25 1956.50 1956.75
## 1957 1957.00 1957.25 1957.50 1957.75
## 1958 1958.00 1958.25 1958.50 1958.75
## 1959 1959.00 1959.25 1959.50 1959.75
## 1960 1960.00 1960.25 1960.50 1960.75
## 1961 1961.00 1961.25 1961.50 1961.75
## 1962 1962.00 1962.25 1962.50 1962.75
## 1963 1963.00 1963.25 1963.50 1963.75
## 1964 1964.00 1964.25 1964.50 1964.75
## 1965 1965.00 1965.25 1965.50 1965.75
## 1966 1966.00 1966.25 1966.50 1966.75
## 1967 1967.00 1967.25 1967.50 1967.75
## 1968 1968.00 1968.25

Let us take another example . We will use a dataset that consists of monthly prices of the S&P Composite Index for the last three years starting from July 2014 to June 2017. We load the original data in a data vector called sp_vector and plot it.

# create sp_vector
sp_vector <- c(1973.1,1961.53,1993.23,1937.27,2044.57,2054.27,2028.18,2082.2,2079.99,2094.86,2111.94,2099.29,2094.14,2039.87,1944.41,2024.81,2080.62,2054.08,1918.6,1904.42,2021.95,2075.54,2065.55,2083.89,2148.9,2170.95,2157.69,2143.02,2164.99,2246.63,2275.12,2329.91,2366.82,2359.31,2395.35,2433.99)

# plot sp_vector
plot(sp_vector)

As you can see, the plot does not contain any time index information.

We can use the ts() function to convert this vector into a time series object. Our data set contains monthly stock prices from July 2017 to June 2017. There are 12 observations per year starting from July 2014. So, we will set the frequency to 12 and the set start argument to c(2014,7) where the first value is the year and the second value is the month index (July)

# convert to time series object 
sp_ts <- ts(sp_vector,start=c(2014,7),frequency=12)

# plot the time series
plot(sp_ts)

Modifying the start period of Time Series

We can modify the start time of the Time Series by specifying start=c(strt_yr , srt_mth) when we are creating the time series object

# Refining the start argument
mytimeseries2 = ts(data = mydata, 
                  start = c(1956,3), frequency = 4)
# plot the refined time series
plot(mytimeseries2)

3.2 Visualizing Time Series

A significant part of statistics is the visualization of the results, after all, you want to communicate your findings to a broader audience whenever you do this, make it under the assumption that the people concerned have no or only a limited statistical understanding.That means the visualizations should be simple and in no case, overloaded.

Graphs enable many features of the data to be visualised, including patterns, unusual observations, changes over time, and relationships between variables. The features that are seen in plots of the data must then be incorporated, as much as possible, into the forecasting methods to be used.While base R is good for plotting Time Series data , it needs additional tuning to create meaningful Time Series Visualizations. The ggplot2 package serves our purpose for creating stunning Time Series visualizations.The latest version of the forecast package integrates ggplot2 to produce time series relevant plots.

A time series data set requires a line chart.Since this sort of chart implies a certain order along a timeline, the timeline is usually on the x axis i.e. the observations are plotted against the time of observation, with consecutive observations joined by straight lines.

Let us look at standard TS plots in base-R

# Standard R Base plots
plot(nottem) 

If we want to see the components of our Time series decomposed one by one, we can apply the plot function decompose to our TS and visualize the components trend, seasonal and random with the original data set on top

# Plot of components
plot(decompose(nottem))

Let us do the same visualizations with forecast package which has a neat “autoplot” function

# load the library
library(forecast)
# The ggplot equivalent to plot
forecast::autoplot(nottem)

The fpp2 package created by Rob Hyndmann is a versatile package for analyzing Time Series Data…

Let us look at a few more TS plots…

library(fpp2)
# weekly economy passenger load on Ansett Airlines between 
# Australia’s two largest cities (melsyd is the dataset)
autoplot(melsyd[,"Economy.Class"]) +
  ggtitle("Economy class passengers: Melbourne-Sydney") +
  xlab("Year") +
  ylab("Thousands")

# Monthly sales of antidiabetic drugs in Australia.
autoplot(a10) +
  ggtitle("Antidiabetic drug sales") +
  ylab("$ million") +
  xlab("Year")

Seasonal and Sub-Series Plots

A seasonal plot ggseason plot()is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed.There is also an option to plot monthly trends using monthplot().

Note : For seasonal plots , there needs to be inherent seasonality in the Time Series

# Time series specific plots
ggseasonplot(nottem) 

We notice that each year is represented by one line.The corresponding month is available on the X axis midway. it is quite clear which months are the cold ones and which ones are warmer.

# Antidiabetic drug sales in Austraila
ggseasonplot(a10, year.labels=TRUE, year.labels.left=TRUE) +
  ylab("$ million") +
  ggtitle("Seasonal plot: antidiabetic drug sales")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

There is a large jump in sales in January each year , these are probably sales in late December as customers stockpile before the end of the calendar year, but not registered with the government until a week or two later.

# Time series specific plots
ggmonthplot(nottem)

In the monthplot() , each month has its own line and you can see how this particular month develops over the years i.e. each month has its own line chart. We also get the mean for each month denoted by the solid blue line.

A useful variation on the seasonal plot uses polar coordinates. Setting polar=TRUE makes the time series axis circular rather than horizontal, as shown below

ggseasonplot(a10, polar=TRUE) +
  ylab("$ million") +
  ggtitle("Polar seasonal plot: antidiabetic drug sales")

Let us examine the Airpassengers dataset that is part of the forecast package

# examine top 5 rows
head(AirPassengers)
##      Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135
# examine characteristics of TS
print(class(AirPassengers))
## [1] "ts"
print(start(AirPassengers))
## [1] 1949    1
print(end(AirPassengers))
## [1] 1960   12
print(frequency(AirPassengers))
## [1] 12
summary(AirPassengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

This is a Time Series data with a Monthly Frequency starting at January-1940 and ending at December-1960

Maximum passengers : 622K ; Minimum Passengers : 104K

Let us plot the original data

# plot the Airpassengers TS
autoplot(AirPassengers)

Increasing Trends Year of Year …. Seasonal variations exist (Mid Year , End of the Year)….

ggseasonplot(AirPassengers)

ggmonthplot(AirPassengers)

ggseasonplot(AirPassengers, polar=TRUE) +
  ylab("Thousands") +
  ggtitle("Polar seasonal plot: Air Passenger Traffic")

June and July had recorded the highest number of passengers every year — Holiday Season

The volume of passengers has increased every year — Strong upward trend

Let us look at another dataset from Rob Hyndman’s book “Forecasting Principles and Practice” tute.csv.

The dataset has four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation

# read the data
tute = read.csv("tute.csv" , header = TRUE)
# examine the structure
str(tute)
## 'data.frame':    100 obs. of  4 variables:
##  $ X       : chr  "Mar-81" "Jun-81" "Sep-81" "Dec-81" ...
##  $ Sales   : num  1020 889 795 1004 1058 ...
##  $ AdBudget: num  659 589 512 614 647 ...
##  $ GDP     : num  252 291 291 292 279 ...

The 1st column contains the quarters as we don’t need them now. We will omit this column when converting to TS

# convert to a time seties
mytimeseries <- ts(tute[,-1], start=1981, frequency=4)
class(mytimeseries)
## [1] "mts"    "ts"     "matrix"

This is a “Multivariate Time Series” since we have multiple columns

Let us construct Time Series plots for each of the 4 Series with and without the argument facets = TRUE

library(patchwork)
# basic autoplot time series plot
p1 = autoplot(mytimeseries)

# faceted autoplot time series plot
p2 = autoplot(mytimeseries , facets = TRUE)

p1 + p2

When we add the facet argument , individual time series are plotted with respective y-scale limits.When we omit the facet argument , all three series share the same y-scale limits.

Let us take one more example of the arrivals data set which comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.

We will use autoplot(), ggseasonplot() and ggsubseriesplot() to compare the differences between the arrivals from these four countries and spot any strange observations.

# Examine top 5 rows
head(arrivals)
##          Japan     NZ     UK     US
## 1981 Q1 14.763 49.140 45.266 32.316
## 1981 Q2  9.321 87.467 19.886 23.721
## 1981 Q3 10.166 85.841 24.839 24.533
## 1981 Q4 19.509 61.882 52.264 33.438
## 1982 Q1 17.117 42.045 53.636 33.527
## 1982 Q2 10.617 63.081 34.802 28.366
# Examine summary Statistics
summary(arrivals)
##      Japan               NZ               UK               US        
##  Min.   :  9.321   Min.   : 37.04   Min.   : 19.89   Min.   : 23.72  
##  1st Qu.: 74.135   1st Qu.: 96.52   1st Qu.: 53.89   1st Qu.: 63.95  
##  Median :135.461   Median :154.54   Median : 95.56   Median : 85.88  
##  Mean   :122.080   Mean   :170.59   Mean   :106.86   Mean   : 84.85  
##  3rd Qu.:176.752   3rd Qu.:228.60   3rd Qu.:128.13   3rd Qu.:108.98  
##  Max.   :227.641   Max.   :330.81   Max.   :269.29   Max.   :136.09
# Plot seasonal Time Series (with and without facets)
p1= autoplot(arrivals)
p2 = autoplot(arrivals , facets = TRUE)
p1+p2

# Subseries plots by country
p1 = ggsubseriesplot(arrivals[,1]) + ggtitle("Japanese Arrivals")
p2 = ggsubseriesplot(arrivals[,2]) + ggtitle("New Zealand Arrivals")
p3 = ggsubseriesplot(arrivals[,3]) + ggtitle("UK Arrivals")
p4 = ggsubseriesplot(arrivals[,4]) + ggtitle("USA Arrivals")

p1 + p2 + p3 + p4

Inference :

  • Japan : Arrivals into Australia increased till late 90s and then began to drop starting the turn of the millennium

  • New Zeland : Arrivals into Australia are increasing year on year with highest volume of passengers since 2005.

  • UK & USA : Arrivals into Australia increased @ 2000 and then on there seems to be a stagnation

3.3 Missing data in Time Series

Missing data can arise for many reasons, and it is worth considering whether the missingness will induce bias in the forecasting model. For example, suppose we are studying sales data for a store, and missing values occur on public holidays when the store is closed. The following day may have increased sales as a result. If we fail to allow for this in our forecasting model, we will most likely under-estimate sales on the first day after the public holiday, but over-estimate sales on the days after that

In other situations, the missingness may be essentially random. For example, someone may have forgotten to record the sales figures, or the data recording device may have malfunctioned. If the timing of the missing data is not informative for the forecasting problem, then the missing values can be handled more easily

3.3.1 Missing Data Imputation

The imputeTS package specializes on univariate time series imputation. It offers multiple state-of-the-art imputation algorithm implementations along with plotting functions for time series missing data statistics.

https://cran.r-project.org/package=imputeTS

When missing values cause errors, there are at least two ways to handle the problem. First, we could just take the section of data after the last missing value, assuming there is a long enough series of observations to produce meaningful forecasts. Alternatively, we could replace the missing values with estimates.

For non-seasonal data , simple linear interpolation is used to fill in the missing sections. For seasonal data, an STL decomposition is used to estimate the seasonal component, and the seasonally adjusted series are linear interpolated.

Alternative methods include Imputation by Last Observation Carried Forward (forward-fill) , Imputation by Next Observation Carried Backward (Backward-fill) , Seasonally Decomposed Missing Value Imputation etc…

Let us understand these concepts by analyzing two cases…

Case 1 : Sensor Readings

# load data
mydata = read.csv("ts_na_ol.csv")
str(mydata)
## 'data.frame':    250 obs. of  2 variables:
##  $ X     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mydata: num  32.8 42.5 NA 32.2 55.6 ...
# Convert to Time series only "mydata column"
myts = ts(mydata$mydata)
str(myts)
##  Time-Series [1:250] from 1 to 250: 32.8 42.5 NA 32.2 55.6 ...
# Load library
library(imputeTS)
# plot missing values
ggplot_na_distribution(myts)

# Missing statistics
statsNA(myts)
## [1] "Length of time series:"
## [1] 250
## [1] "-------------------------"
## [1] "Number of Missing Values:"
## [1] 5
## [1] "-------------------------"
## [1] "Percentage of Missing Values:"
## [1] "2%"
## [1] "-------------------------"
## [1] "Number of Gaps:"
## [1] 5
## [1] "-------------------------"
## [1] "Average Gap Size:"
## [1] 1
## [1] "-------------------------"
## [1] "Stats for Bins"
## [1] "  Bin 1 (63 values from 1 to 63) :      3 NAs (4.76%)"
## [1] "  Bin 2 (63 values from 64 to 126) :      0 NAs (0%)"
## [1] "  Bin 3 (63 values from 127 to 189) :      1 NAs (1.59%)"
## [1] "  Bin 4 (61 values from 190 to 250) :      1 NAs (1.64%)"
## [1] "-------------------------"
## [1] "Longest NA gap (series of consecutive NAs)"
## [1] "1 in a row"
## [1] "-------------------------"
## [1] "Most frequent gap size (series of consecutive NA series)"
## [1] "1 NA in a row (occurring 5 times)"
## [1] "-------------------------"
## [1] "Gap size accounting for most NAs"
## [1] "1 NA in a row (occurring 5 times, making up for overall 5 NAs)"
## [1] "-------------------------"
## [1] "Overview NA series"
## [1] "  1 NA in a row: 5 times"

Overall 2% missing values , with data missing at irregular intervals….

# plot original time series
autoplot(myts)

Impute missing values with interpolation and plot the Time Series

my_imputed_ts <- na_interpolation(myts)
ggplot_na_imputations(myts, my_imputed_ts)

Red dots indicate “Imputed Missing Values”… Blue dots are the existing values

Case 2 : Time series of monthly airline passengers (with NAs) The tsAirgap time series has 144 rows aand includes 14 NA values

# Visualize Missing Distribution
ggplot_na_distribution(tsAirgap)

# Visualize Missing % across intervals
ggplot_na_intervals(tsAirgap)

# Impute missing and Visualize
imp <- na_interpolation(tsAirgap)
ggplot_na_imputations(tsAirgap, imp)

3.4 Outliers in Time Series

Outliers are observations that are very different from the majority of the observations in the time series. They may be errors, or they may simply be unusual.We may wish to replace them with missing values, or with an estimate that is more consistent with the majority of the data.Simply replacing outliers without thinking about why they have occurred is a dangerous practice. They may provide useful information about the process that produced the data, and which should be taken into account when forecasting.

3.4.1 Outlier Imputation

Outliers may indicate anomalies e.g. Occurrence Fraudulent Transactions , Spike in Blood Glucose Levels , Exceptional Events impacting stock prices etc…If we assume that the outliers are genuinely errors, or that they won’t occur in the forecasting period, then replacing them can make the forecasting task easier.

The tsoutliers() function from the forecast package is designed to identify outliers, and to suggest potential replacement values

Let us examine outliers in our sensor Time series (myts)

# plot the time series
autoplot(myts)

There are peaks which correspond to outliers

# examine outliers
tsoutliers(myts)
## $index
## [1]  45  98 172 211
## 
## $replacements
## [1] 28.41242 39.78616 33.59838 37.96883

We have 4 Outlier s and suggested replacements identified

The function tsclean() identifies and replaces outliers, and also replaces missing values.

# Cleaning NA and outliers with forecast package
mytsclean = tsclean(myts)

# Summary Statistics
print(summary(mytsclean))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.683  28.157  34.567  35.025  41.830  64.633
# Plot the Original and Cleaned Timeseries
p1 = autoplot(myts)

p2 = autoplot(mytsclean)

p1 + p2

4 References