# load libraries
library(tidyverse) # meta package for manipulating and visualizing data frames
library(lubridate) # packagae for manipulating dates
library(readxl)    # reading excel files
library(forecast)  # meta package Time Series forecasting & Linear Models
library(fpp2)      # data for "Forecasting: Principles and Practice" (
library(patchwork) #
library(ggplot2)
library(xts)
library(directlabels) # labeling plots

1 Introduction

Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.

The predictability of an event or a quantity depends on several factors including:

  1. how well we understand the factors that contribute to it;

  2. how much data is available;

  3. whether the forecasts can affect the thing we are trying to forecast.

For example, forecasts of electricity demand can be highly accurate because all three conditions are usually satisfied. We have a good idea of the contributing factors: electricity demand is driven largely by temperatures, with smaller effects for calendar variation such as holidays, and economic conditions.

On the other hand, when forecasting currency exchange rates, only one of the conditions is satisfied: there is plenty of available data. However, we have a limited understanding of the factors that affect exchange rates, and forecasts of the exchange rate have a direct effect on the rates themselves. If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.

Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again

Forecasting situations vary widely in their time horizons, factors determining actual outcomes, types of data patterns, and many other aspects. Forecasting methods can be simple, such as using the most recent observation as a forecast (which is called the naïve method), or highly complex, such as neural nets and econometric systems of simultaneous equations. Sometimes, there will be no data available at all. For example, we may wish to forecast the sales of a new product in its first year, but there are obviously no data to work with. In situations like this, we use judgmental forecasting.The choice of method depends on what data are available and the predictability of the quantity to be forecast.

Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning. However, business forecasting is often done poorly, and is frequently confused with planning and goals. They are three different things.

Forecasting is about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts.

Goals are what you would like to have happen. Goals should be linked to forecasts and plans, but this does not always occur. Too often, goals are set without any plan for how to achieve them, and no forecasts for whether they are realistic.

Planning is a response to forecasts and goals. Planning involves determining the appropriate actions that are required to make your forecasts match your goals.

Short-term forecasts are needed for the scheduling of personnel, production and transportation. As part of the scheduling process, forecasts of demand are often also required

Medium-term forecasts are needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.

Long-term forecasts are used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.

Examples of time series data include:

  • Annual Google profits

  • Quarterly sales results for Amazon

  • Monthly rainfall

  • Weekly retail sales

  • Daily IBM stock prices

  • Hourly electricity demand

References :

2 Time Series Basics

Univariate Time Series is a sequence of measurements of the same variable collected over time. Most often, the measurements are made at regular time intervals.One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. One defining characteristic of a time series is that it is a list of observations where the ordering matters. Ordering is very important because there is dependency and changing the order could change the meaning of the data

2.0.1 Objectives of the Analysis

The basic objective usually is to determine a model that describes the pattern of the time series. Uses for such a model are:

  1. To describe the important features of the time series pattern.

  2. To explain how the past affects the future or how two time series can “interact”.

  3. To forecast future values of the series.

  4. To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations.

2.0.2 Types of Models

There are two basic types of “time domain” models.

  1. Models that relate the present value of a series to past values and past prediction errors - these are called ARIMA models (for Autoregressive Integrated Moving Average).

  2. Ordinary regression models that use time indices as x-variables. These can be helpful for an initial description of the data and form the basis of several simple forecasting methods.

2.0.3 Characteristics of Time Series

Some important questions to first consider when first looking at a time series are:

  • Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time?

  • Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on?

  • Are there outliers? In regression, outliers are far away from your line. With time series data, your outliers are far away from your other data.

  • Is there a long-run cycle or period unrelated to seasonality factors?

  • Is there constant variance over time, or is the variance non-constant?

  • Are there any abrupt changes to either the level of the series or the variance?

2.0.4 Predictor variables and time series forecasting

Predictor variables are often useful in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form

ED=f(current temperature, strength of economy, population,time of day, day of week, error).

The relationship is not exact — there will always be changes in electricity demand that cannot be accounted for by the predictor variables. The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. We call this an explanatory model because it helps explain what causes the variation in electricity demand.

Because the electricity demand data form a time series, we could also use a time series model for forecasting. In this case, a suitable time series forecasting equation is of the form

EDt+1=f(EDt,EDt−1,EDt−2,EDt−3,…,error) {t is the present hour, t+1 is the next hour, t−1 is the previous hour and so on. }

Here, prediction of the future is based on past values of a variable, but not on external variables which may affect the system. Again, the “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model.

2.0.5 Steps in Forecasting

  1. Problem Definition : This step requires an understanding of the way the forecasts will be used, who requires the forecasts, and how the forecasting function fits within the organisation requiring the forecasts

  2. Gathering Information : Two kinds of information required (a) statistical data, and (b) the accumulated expertise of the people who collect the data and use the forecasts. Often, it will be difficult to obtain enough historical data to be able to fit a good statistical model. In that case, the judgmental forecasting methods can be used. Occasionally, old data will be less useful due to structural changes in the system being forecast; then we may choose to use only the most recent data.

  3. Preliminary Analysis (EDA) : Here we start by graphing the data and answering the following questions —

    • Are there consistent patterns?

    • Is there a significant trend?

    • Is seasonality important? Is there evidence of the presence of business cycles?

    • Are there any outliers in the data that need to be explained by those with expert knowledge?

    • How strong are the relationships among the variables available for analysis?

  4. Choosing and Fitting Models : The model to use depends on the availability of historical data, the strength of relationships between the forecast variable and any explanatory variables, and the way in which the forecasts are to be used.These models include regression models , exponential smoothing methods , Box-Jenkins ARIMA models and several advanced methods including neural networks and vector autoregression.

  5. Using and Evaluating Forecasting Models : Once a model has been selected and its parameters estimated, the model is used to make forecasts.A number of methods have been developed to help in assessing the accuracy of forecasts some of which include Root Mean Square Error (RMSE) , Mean Absolute Error (MAE) , Akaike’s Information Criterion (AIC) , Bayesian Information Criterion (BIC) and so on…

3 Working with Dates : Lubridate

If you have to manage dates and times in any analytics role, it might get confusing pretty fast. Things like time zones, formats, leap year or leap seconds need to be considered.Depending on which type of task you want to perform with your date and time data, you have a lot of different tools available.We will take a look at standard tools in R base and then we dive deeper into lubricate, which is probably the most convenient way to handle this sort of data.

Main use cases of lubridate related to Time Series include :

  • Date and Time based calculations

  • Advanced format conversions

  • Changing time zone

  • Checking for leap years

Creating a date-time object

There are three types of date/time data that refer to an instant in time:

  • A date. Tibbles print this as <date>.

  • A time within a day. Tibbles print this as <time>.

  • A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Elsewhere in R these are called POSIXct

To get the current date or date-time you can use today() or now()

# todays date
today()
## [1] "2023-07-31"
# current time
now()
## [1] "2023-07-31 19:15:51 IST"

3.0.1 Creating date/times

There are three ways to create a date/time:

  • From a string.

  • From individual date-time components.

  • From an existing date/time object.

3.0.1.1 From Strings

Date/time data often comes as strings. You’ve seen one approach to parsing strings into date-times in date-times.

Another approach is to use the helpers provided by lubridate that automatically work out the format once you specify the order of the component. To use them, we identify the order in which year, month, and day appear in our dates, then arrange “y”, “m”, and “d” in the same order.

Let’s look at some examples of converting a date stored as a string into a date object applying the lubridate helpers

# Convert the string date to date object Year , Month , Day (helper = ymd)
my_date = ymd("2017-01-31")
print(my_date)
## [1] "2017-01-31"
print(class(my_date))
## [1] "Date"
# Convert the string date to date object Month Day Year (helper = mdy)
my_date_2 = mdy("January 31st, 2017")
print(my_date_2)
## [1] "2017-01-31"
print(class(my_date_2))
## [1] "Date"
# Convert the string date to date object Day Month Year (helper = dmy)
my_date_3 = dmy("31-Jan-2017")
print(my_date_3)
## [1] "2017-01-31"
class(my_date_3)
## [1] "Date"

The lubridate helpers can also take “unquoted objects” and convert them to a date object

# Convert unquoted string into date Year-Month-Day
ymd(20170131)
## [1] "2017-01-31"

3.0.1.2 From Individual Components

Instead of a single string, sometimes we may have the individual components of the date-time spread across multiple columns. Let’s look at the flights data which is part of the nycflights13 package

library(nycflights13)
# flights data
head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We have the following columns :

year, month, day : Date of departure split into year , month and day

dep_time, arr_time : Actual departure and arrival times (format HHMM or HMM), local tz.

sched_dep_time, sched_arr_time : Scheduled departure and arrival times (format HHMM or HMM), local tz.

hour, minute : Time of scheduled departure broken into hour and minutes

We will make a date time object from this sort of input, using make_date() for dates, or make_datetime() for date-times

Let’s extract only year , month , day , hour and minute columns and make the date object dep_date_time using the make-datetime() function from lubridate

# create the departure time object
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(dep_date_time = make_datetime(year, month, day, hour, minute)) %>% head()
## # A tibble: 6 × 6
##    year month   day  hour minute dep_date_time      
##   <int> <int> <int> <dbl>  <dbl> <dttm>             
## 1  2013     1     1     5     15 2013-01-01 05:15:00
## 2  2013     1     1     5     29 2013-01-01 05:29:00
## 3  2013     1     1     5     40 2013-01-01 05:40:00
## 4  2013     1     1     5     45 2013-01-01 05:45:00
## 5  2013     1     1     6      0 2013-01-01 06:00:00
## 6  2013     1     1     5     58 2013-01-01 05:58:00

Because we have constructed the departure time using hours and minutes , the object is of class POSIXct

Let’s create a departure date only column combining year , month and day

# create a departure day column
flights %>% select(year, month, day) %>%
  mutate(dep_day = make_date(year, month, day)) %>% head()
## # A tibble: 6 × 4
##    year month   day dep_day   
##   <int> <int> <int> <date>    
## 1  2013     1     1 2013-01-01
## 2  2013     1     1 2013-01-01
## 3  2013     1     1 2013-01-01
## 4  2013     1     1 2013-01-01
## 5  2013     1     1 2013-01-01
## 6  2013     1     1 2013-01-01

3.0.1.3 From Other types

We may want to switch between a date-time and a date. Here we apply the as_datetime() and as_date() functions

# as date_time
as_datetime(today())
## [1] "2023-07-31 UTC"
# as date
as_date(now())
## [1] "2023-07-31"

3.0.1.4 Exercises

  • What happens if you parse a string that contains invalid dates ymd(c(“2010-10-10”, “bananas”))
ymd(c("2010-10-10", "bananas"))
## [1] "2010-10-10" NA

The last object is not recognized as a date and is converted to a missing value

  • Use the appropriate lubridate function to parse each of the following dates
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
# solution 1
print(d1)
## [1] "January 1, 2010"
date_d1 <- mdy(d1)
print(date_d1)
## [1] "2010-01-01"
print(class(date_d1))
## [1] "Date"
# solution 2
print(d2)
## [1] "2015-Mar-07"
date_d2 <- ymd(d2)
print(date_d2)
## [1] "2015-03-07"
print(class(date_d2))
## [1] "Date"
# solution 3
print(d3)
## [1] "06-Jun-2017"
date_3 <- dmy(d3)
print(date_3)
## [1] "2017-06-06"
print(class(date_3))
## [1] "Date"
# solution 4
print(d4)
## [1] "August 19 (2015)" "July 1 (2015)"
date_d4 <- mdy(d4)
print(date_d4)
## [1] "2015-08-19" "2015-07-01"
print(class(date_d4))
## [1] "Date"

3.0.2 Date-Time Components (Feature Engineering)

Now that we know how to get date-time data into R’s date-time data structures, let’s explore what we can do with them.

  • Using accessor functions that let us get and set individual components.

  • Look at how arithmetic works with date-times.

3.0.2.1 Getting Components

You can pull out individual parts of the date with the accessor functions year()month(), mday(),(day of the month), yday(),(day of the year), wday(),(day of the week), hour()minute(), and second()

# create an object
mydatetime <- ymd_hms("2016-07-08 12:34:56")
mydatetime
## [1] "2016-07-08 12:34:56 UTC"
# extract the year
year(mydatetime)
## [1] 2016
# extract month
month(mydatetime)
## [1] 7
# extract day
day(mydatetime)
## [1] 8
# extract week day
wday(mydatetime)
## [1] 6

For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.

# extarct week day
wday(mydatetime , label = TRUE , abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
# extract month
month(mydatetime , label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

Let’s practice component extraction on a new dataset that contains transactions. The data set has 3 columns. All the dates are in the format (yyyy-mm-dd).

  • Invoice: invoice date

  • Due: due date

  • Payment: payment date

# load the transactions dataset
transactions <- read_csv("Data/transact.csv")
head(transactions)
## # A tibble: 6 × 3
##   Invoice    Due        Payment   
##   <date>     <date>     <date>    
## 1 2013-01-02 2013-02-01 2013-01-15
## 2 2013-01-26 2013-02-25 2013-03-03
## 3 2013-07-03 2013-08-02 2013-07-08
## 4 2013-02-10 2013-03-12 2013-03-17
## 5 2012-10-25 2012-11-24 2012-11-28
## 6 2012-01-27 2012-02-26 2012-02-22

Notice how read_csv has already converted the Invoice Date column to date type

Lets add new columns with date, month and year from the Due column

# create new columns
transactions %>%
  mutate(
    due_day   = day(Due),
    due_month = month(Due),
    due_year  = year(Due)
  ) %>% head()
## # A tibble: 6 × 6
##   Invoice    Due        Payment    due_day due_month due_year
##   <date>     <date>     <date>       <int>     <dbl>    <dbl>
## 1 2013-01-02 2013-02-01 2013-01-15       1         2     2013
## 2 2013-01-26 2013-02-25 2013-03-03      25         2     2013
## 3 2013-07-03 2013-08-02 2013-07-08       2         8     2013
## 4 2013-02-10 2013-03-12 2013-03-17      12         3     2013
## 5 2012-10-25 2012-11-24 2012-11-28      24        11     2012
## 6 2012-01-27 2012-02-26 2012-02-22      26         2     2012

Let us estimate the number of days to settle the invoice by subtracting the date of invoice from the date of payment and then summarize it

# create the days_to_pay feature
invoice_due <- transactions %>%
  mutate(days_to_pay = as.numeric(Payment - Invoice))

# top 10 rows
head(invoice_due , n = 10)
## # A tibble: 10 × 4
##    Invoice    Due        Payment    days_to_pay
##    <date>     <date>     <date>           <dbl>
##  1 2013-01-02 2013-02-01 2013-01-15          13
##  2 2013-01-26 2013-02-25 2013-03-03          36
##  3 2013-07-03 2013-08-02 2013-07-08           5
##  4 2013-02-10 2013-03-12 2013-03-17          35
##  5 2012-10-25 2012-11-24 2012-11-28          34
##  6 2012-01-27 2012-02-26 2012-02-22          26
##  7 2013-08-13 2013-09-12 2013-09-09          27
##  8 2012-12-16 2013-01-15 2013-01-12          27
##  9 2012-05-14 2012-06-13 2012-07-01          48
## 10 2013-07-01 2013-07-31 2013-07-26          25
# summary statistics
summary(invoice_due$days_to_pay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   18.00   26.00   26.44   35.00   75.00

On an average , payments are delayed by ~27 days with a maximum of 75 days

3.0.2.2 Exercise

The Normalized Difference Moisture Index (NDMI) detects moisture levels in vegetation using a combination of near-infrared (NIR) and short-wave infrared (SWIR) spectral bands. It is a reliable indicator of water stress in crops. Our dataset contains NDMI values recorded over the period May-1993 through Sep-2010. Let’s apply our component extraction skills to analyze this data

# Read the data
NDMI <- read_csv("Data/PlotNDMI.csv")
# Data characteristics
glimpse(NDMI)
## Rows: 752
## Columns: 2
## $ Date <chr> "01-Aug-02", "01-Aug-04", "01-Aug-07", "01-Aug-93", "01-Jul-01", …
## $ NDMI <dbl> -0.067809326, -0.207316202, -0.145061074, -0.016266738, -0.136528…

Notice that the Date is stored as a String. Let’s convert it to Date type applying the dmy() function from lubridate

# createing the date column
NDMI %>% mutate(Date_2 = dmy(Date)) %>% head()
## # A tibble: 6 × 3
##   Date         NDMI Date_2    
##   <chr>       <dbl> <date>    
## 1 01-Aug-02 -0.0678 2002-08-01
## 2 01-Aug-04 -0.207  2004-08-01
## 3 01-Aug-07 -0.145  2007-08-01
## 4 01-Aug-93 -0.0163 1993-08-01
## 5 01-Jul-01 -0.137  2001-07-01
## 6 01-Jul-05  0.0721 2005-07-01
# overwriting the data frame with new column
NDMI <- NDMI %>% mutate(Date_2 = dmy(Date))

head(NDMI)
## # A tibble: 6 × 3
##   Date         NDMI Date_2    
##   <chr>       <dbl> <date>    
## 1 01-Aug-02 -0.0678 2002-08-01
## 2 01-Aug-04 -0.207  2004-08-01
## 3 01-Aug-07 -0.145  2007-08-01
## 4 01-Aug-93 -0.0163 1993-08-01
## 5 01-Jul-01 -0.137  2001-07-01
## 6 01-Jul-05  0.0721 2005-07-01

Let’s extract the Year and Month from Date2 into new columns

# createing year and month of observation
NDMI <- NDMI %>% 
  mutate(Year = year(Date_2) , 
         Month = month(Date_2))

head(NDMI)
## # A tibble: 6 × 5
##   Date         NDMI Date_2      Year Month
##   <chr>       <dbl> <date>     <dbl> <dbl>
## 1 01-Aug-02 -0.0678 2002-08-01  2002     8
## 2 01-Aug-04 -0.207  2004-08-01  2004     8
## 3 01-Aug-07 -0.145  2007-08-01  2007     8
## 4 01-Aug-93 -0.0163 1993-08-01  1993     8
## 5 01-Jul-01 -0.137  2001-07-01  2001     7
## 6 01-Jul-05  0.0721 2005-07-01  2005     7

Let’s summarize the mean NDMI values by month and year applying the aggregate function

# mean NDMI by month and year
NDMI_AGG <- aggregate(NDMI~Month+Year,
                      data = NDMI,
                      FUN = mean)

# top 10 values
head(NDMI_AGG , n = 10)
##    Month Year        NDMI
## 1      5 1993 -0.09658576
## 2      6 1993 -0.04912763
## 3      7 1993  0.01793603
## 4      8 1993  0.04755371
## 5      9 1993  0.04672308
## 6      5 1994 -0.04652719
## 7      6 1994 -0.01155239
## 8      7 1994  0.06060317
## 9      8 1994 -0.01783854
## 10     9 1994  0.01663222

Look at the 1st observation : For the 5th Month of Year 1993 the NDMI values were on an average -0.0965

4 Creating and Visualizing Time Series

The data for the time series is stored in an R object called time-series object. It is also a R data object like a vector or data frame. The time series object is created by using the ts() function.

While you can have data containing dates and corresponding values in an R object of any other class such as a data frame, creating objects of ts class offers many benefits such as the time index information. Also, when you plot a ts object, it automatically creates a plot over time

Syntax : timeseries.object.name <- ts(data, start, end, frequency)

The function ts() takes in three arguments:

  • data is set to everything in the dataframe / vector except for the date column; it isn’t needed since the ts object will store time information separately.

  • start indicates the start time of the first observation {1st available value of the 1st cycle}

  • frequency is set to indicate periodicity of the observations{# of observations per cycle}

The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as below −

  • frequency = 12 pegs the data points for every month of a year.

  • frequency = 4 pegs the data points for every quarter of a year.

  • frequency = 6 pegs the data points for every 10 minutes of an hour.

  • frequency = 24*6 pegs the data points for every 10 minutes of a day.

  • frequency = 52 pegs the data points for every week for a 52 week period

If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25). Similarly, data that are observed every minute might have an hourly seasonality (frequency=60), a daily seasonality (frequency=24×60=1440), a weekly seasonality (frequency=24×60×7=10080) and an annual seasonality (frequency=24×60×365.25=525960)

While specifying the start and frequency , care should be taken about the cycle and number of measurements per cycle.

4.0.1 Time-Series from scratch

Let us create a Time-Series Data-set of 50 evenly distributed observations with a min of 10 and a max of 45. Let us assume it is Quarterly Revenue starting from 1956.

# Create a Vector of 50 evenly spaced observations
mydata = runif(n = 50, min = 10, max = 45)
# Convert Vector to Time-Series "ts" class
# Data starts in 1956 - 4 observations/year (quarterly)
mytimeseries = ts(data = mydata, 
                  start = 1956, 
                  frequency = 4)
# Plot the Time Series
plot(mytimeseries)

Let us check the class to validate if it is TS or not

# checcking class
class(mytimeseries)
## [1] "ts"

Let us now check the “timestamp” for each data point

# checking timestamp
time(mytimeseries)
##         Qtr1    Qtr2    Qtr3    Qtr4
## 1956 1956.00 1956.25 1956.50 1956.75
## 1957 1957.00 1957.25 1957.50 1957.75
## 1958 1958.00 1958.25 1958.50 1958.75
## 1959 1959.00 1959.25 1959.50 1959.75
## 1960 1960.00 1960.25 1960.50 1960.75
## 1961 1961.00 1961.25 1961.50 1961.75
## 1962 1962.00 1962.25 1962.50 1962.75
## 1963 1963.00 1963.25 1963.50 1963.75
## 1964 1964.00 1964.25 1964.50 1964.75
## 1965 1965.00 1965.25 1965.50 1965.75
## 1966 1966.00 1966.25 1966.50 1966.75
## 1967 1967.00 1967.25 1967.50 1967.75
## 1968 1968.00 1968.25

We have 4 Quarters every year

4.0.2 Time-Series from Data-frame

Let us read in some time series data from an xlsx file using read_excel(), a function from the readxl package, and store the data as a ts object

# getting data
mydata = read_excel("Data/spend_data.xlsx")

Let us look at the Top 6 Rows of the data

# top 6 rows
head(mydata , n=6)
## # A tibble: 6 × 4
##   ...1   Sales AdBudget   GDP
##   <chr>  <dbl>    <dbl> <dbl>
## 1 Mar-81 1020.     659.  252.
## 2 Jun-81  889.     589   291.
## 3 Sep-81  795      512.  291.
## 4 Dec-81 1004.     614.  292.
## 5 Mar-82 1058.     647.  279.
## 6 Jun-82  944.     602   254

There are four observations in 1981, indicating quarterly data with a frequency of four rows per year.

The first observation or start date is Mar-81, the first of four rows for year 1981, indicating that March corresponds with the first period.

Let us convert it to a TS object by specifying start as 1981 and frequency as quarterly

# ts for class time series
# Data starts in 1956 - 4 observations/year (quarterly)
myts = ts(data = mydata[,2:4],         ## All columns starting from 2nd column
                  start = c(1981,1),   ## Start Year 1981 , Start Period 1st Quarter
                  frequency = 4)       ## Quarterly hence freq = 4
glimpse(myts)
##  Time-Series [1:100, 1:3] from 1981 to 2006: 1020 889 795 1004 1058 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "Sales" "AdBudget" "GDP"

We have now created a Time Series Object with the index as the 1st column having Quarter and Year

head(myts)
##          Sales AdBudget   GDP
## 1981 Q1 1020.2    659.2 251.8
## 1981 Q2  889.2    589.0 290.9
## 1981 Q3  795.0    512.5 290.8
## 1981 Q4 1003.9    614.1 292.4
## 1982 Q1 1057.7    647.2 279.1
## 1982 Q2  944.4    602.0 254.0

4.0.3 Visualizing Time Series

The first step in any data analysis task is to plot the data. Graphs enable you to visualize many features of the data, including patterns, unusual observations, changes over time, and relationships between variables. Just as the type of data determines which forecasting method to use, it also determines which graphs are appropriate.

You can use the autoplot() function to produce a time plot of the data with or without facets, or panels that display different subsets of data

example : autoplot(usnim_2002, facets = FALSE) , will plot the “trend lines” for every time-series

The ggplot2 package serves our purpose for creating stunning Time Series visualizations.The latest version of the forecast package integrates ggplot2 to produce time series relevant plots.

A time series data set requires a line chart.Since this sort of chart implies a certain order along a timeline, the timeline is usually on the x axis i.e. the observations are plotted against the time of observation, with consecutive observations joined by straight lines.

Let us plot the data you stored as myts using autoplot() with and without faceting.

# Plot data with faceting
autoplot(myts, facets = TRUE)

# Plot without facetting
autoplot(myts)

When we use “faceting” , the Y-Axis is scaled by individual “Time Series Observations”

Without faceting , the Y-Axis has a constant Scale and individual Time Series can be compared

Let us look at another dataset containing stock movements of 5 stocks

# get the data
stock_data <- read_csv("Data/5stocks.csv")

glimpse(stock_data)
## Rows: 4,001
## Columns: 6
## $ Date <date> 2001-07-05, 2001-07-06, 2001-07-09, 2001-07-10, 2001-07-11, 2001…
## $ AAPL <dbl> 1.66, 1.57, 1.62, 1.51, 1.61, 1.74, 1.78, 1.71, 1.79, 1.48, 1.43,…
## $ AMZN <dbl> 15.27, 15.27, 15.81, 15.61, 15.34, 16.49, 16.98, 16.01, 16.35, 15…
## $ IBM  <dbl> NA, 106.50, 104.72, 101.96, 103.85, 107.25, 108.53, 107.82, 108.5…
## $ WMT  <dbl> NA, 47.34, 48.25, 47.50, 48.85, 51.85, 52.90, 53.48, 53.35, 53.98…
## $ XOM  <dbl> NA, 43.40, 43.36, 42.88, 42.48, 42.72, 42.99, 43.00, 42.65, 42.12…
# top 6 rows
head(stock_data)
## # A tibble: 6 × 6
##   Date        AAPL  AMZN   IBM   WMT   XOM
##   <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001-07-05  1.66  15.3   NA   NA    NA  
## 2 2001-07-06  1.57  15.3  106.  47.3  43.4
## 3 2001-07-09  1.62  15.8  105.  48.2  43.4
## 4 2001-07-10  1.51  15.6  102.  47.5  42.9
## 5 2001-07-11  1.61  15.3  104.  48.8  42.5
## 6 2001-07-12  1.74  16.5  107.  51.8  42.7
# bottom 6 rows
tail(stock_data)
## # A tibble: 6 × 6
##   Date        AAPL  AMZN   IBM   WMT   XOM
##   <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-05-23  154.  972.  152.  78.5  82.6
## 2 2017-05-24  153.  980.  153.  78.2  82.3
## 3 2017-05-25  154.  993.  153.  78.3  81.8
## 4 2017-05-26  154.  996.  152.  78.1  81.6
## 5 2017-05-30  154.  997.  152.  78.2  81.1
## 6 2017-05-31  153.  995.  153.  78.6  80.5

Examining at observations in the Date column we infer the following :

  1. Datais recorded on the 5th of every Month – Frequency = Monthly (12)
  2. Starting from 7-2001 – Start = 2001 , 7
  3. Ending on 5-2017 – End = 2017 , 5

We have Null values in the 1st row so let us omit them

# omit null values
stock_data = na.omit(stock_data)

head(stock_data)
## # A tibble: 6 × 6
##   Date        AAPL  AMZN   IBM   WMT   XOM
##   <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001-07-06  1.57  15.3  106.  47.3  43.4
## 2 2001-07-09  1.62  15.8  105.  48.2  43.4
## 3 2001-07-10  1.51  15.6  102.  47.5  42.9
## 4 2001-07-11  1.61  15.3  104.  48.8  42.5
## 5 2001-07-12  1.74  16.5  107.  51.8  42.7
## 6 2001-07-13  1.78  17.0  109.  52.9  43.0

Let us select only the last 4 columns and convert the data to a time series object applying ts()

# Data starts in 2001 - 7 observations/month (monthly)
mystockts = ts(data = stock_data[,2:6], ## All columns starting from 2nd
                  start = c(2001,6),    ## Start Year 2001 , 7th Month
                  end = c(2017 , 5) ,   ##End Year 2017 , 5th Month
                  frequency = 12)       ## Monthly hence frequency = 12

# Checking start , end and frequency
print(start(mystockts))
## [1] 2001    6
print(end(mystockts))
## [1] 2017    5
print(frequency(mystockts))
## [1] 12

Looks like the 1 row has null values. Let’s omit them

# plotting with facets
autoplot(mystockts , facets = TRUE)

# plotting without facets
autoplot(mystockts)

The graph above shows lot of activity in IBM however it’s is far higher than other stocks

Let’s look at weekly economy passenger load on Ansett Airlines between Melbourne and Sydney. This dataset is part of the fpp2 package (melsyd)

# glimpse the data
head(melsyd)
## Time Series:
## Start = c(1987, 26) 
## End = c(1987, 31) 
## Frequency = 52 
##          First.Class Business.Class Economy.Class
## 1987.481       1.912             NA        20.167
## 1987.500       1.848             NA        20.161
## 1987.519       1.856             NA        19.993
## 1987.538       2.142             NA        20.986
## 1987.558       2.118             NA        20.497
## 1987.577       2.048             NA        20.770

Let’s plot only the Economy Class Data

# Economy Class Passenger Load
autoplot(melsyd[,"Economy.Class"]) +
  ggtitle("Economy class passengers: Melbourne-Sydney") +
  xlab("Year") +
  ylab("Thousands")

  • There was a period in 1989 when no passengers were carried — this was due to an industrial dispute.

  • There was a period of reduced load in 1992. This was due to a trial in which some economy class seats were replaced by business class seats.

  • A large increase in passenger load occurred in the second half of 1991.

  • There are some large dips in load around the start of each year. These are due to holiday effects.

  • There is a long-term fluctuation in the level of the series which increases during 1987, decreases in 1989, and increases again through 1990 and 1991.

  • There are some periods of missing observations

Let’s look at the Anti-diabetic drug sales data set that accompanies the fpp2 package(a10)

# Ant-Diabetic Drug Sales
autoplot(a10) +
  ggtitle("Antidiabetic drug sales") +
  ylab("$ million") +
  xlab("Year")

  • There is a clear and increasing trend.

  • There is a strong seasonal pattern that increases in size as the level of the series increases.

  • There is a sudden drop at the start of each year possibly caused by a government subsidisation scheme that makes it cost-effective for patients to stockpile drugs at the end of the calendar year

4.0.4 Outliers and Frequency of Time Series

Let us learn two more functions which.max() and frequency()
which.max() can be used to identify the smallest index of the maximum value.

– to find the number of observations per unit time, we use frequency()

# Find the outlier in the gold series
goldoutlier <- which.max(gold)
goldoutlier
## [1] 770
# plot the gold series
autoplot(gold)

Observation#770 is the outlier as seen from the Time-Series Plot

# frequency of gold
frequency(gold)
## [1] 1

This is an ANNUAL Time Series

Let’s look at the Australian Gas Prices dataset accompanying the fpp2 package (gas)

# plot the gas prices
autoplot(gas)

# frequency
frequency(gas)
## [1] 12

This is a MONTHLY Time Series

4.0.5 Seasonal and Sub-Series Plot

Along with time plots, there are other useful ways of plotting data to emphasize seasonal patterns and show changes in these patterns over time.

  • A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. You can create one using the ggseasonplot() function the same way you do with autoplot().

  • An interesting variant of a season plot uses polar coordinates, where the time axis is circular rather than horizontal; to make one, simply add a polar argument and set it to TRUE.

  • A sub series plot comprises mini time plots for each season. Here, the mean for each season is shown as a blue horizontal line.

One way of splitting a time series is by using the window() function, which extracts a subset from the object x observed between the times start and end

example : window(x, start = NULL, end = NULL).

Let us visualize two data sets from the fpp2 package :

  • a10 contains monthly sales volumes for anti-diabetic drugs in Australia. In the plots, can you see which month has the highest sales volume each year? What is unusual about the results in March and April 2008?

  • ausbeer which contains quarterly beer production for Australia. What is happening to the beer production in Quarter 4?

# Create the basic plot of the a10 data
p2 = autoplot(a10)

# Create a seasi=onal plot of a10
p3 = ggseasonplot(a10 , year.labels=TRUE, year.labels.left=TRUE) +
  ylab("$ million") +
  ggtitle("Seasonal plot: antidiabetic drug sales")

# Produce a polar coordinate season plot for the a10 data
p1 = ggseasonplot(a10, polar = TRUE)

p1 | p2/p3

There is an increase in drug sales every year as indicated by the 2nd plot.

There is a large jump in sales in January each year , these are probably sales in late December as customers stockpile before the end of the calendar year, but not registered with the government until a week or two later.

# Create the basic plot of the ausbeer data
autoplot(ausbeer)

Sales have steadily increased. Starting 1980 , the sales oscillate around 500. Let us now restrict the ausbeer series to start at 1992 using the window function.

# Restrict the start point to 1992
beer = window(ausbeer , start = 1992)
frequency(beer)
## [1] 4
# Make plots of the beer data
p1 = autoplot(beer)

# Let us look at "subseries" plots of beer
p2 = ggsubseriesplot(beer)

p1 + p2

Q1 : Mean Production ~ 450 {Blue Line} ; Q2 : Mean Production ~ 410 {Blue Line}

Q3 : Mean Production ~ 425 {Blue Line) ; Q4 : Mean Production ~ 525 {Blue Line}

4.0.6 Exercise

The arrivals data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.

  • Use autoplot(), ggseasonplot() and ggsubseriesplot() to compare the differences between the arrivals from these four countries.

  • Can you identify any unusual observations?

# Examine top 5 rows
head(arrivals)
##          Japan     NZ     UK     US
## 1981 Q1 14.763 49.140 45.266 32.316
## 1981 Q2  9.321 87.467 19.886 23.721
## 1981 Q3 10.166 85.841 24.839 24.533
## 1981 Q4 19.509 61.882 52.264 33.438
## 1982 Q1 17.117 42.045 53.636 33.527
## 1982 Q2 10.617 63.081 34.802 28.366
# Examine summary Statistics
summary(arrivals)
##      Japan               NZ               UK               US        
##  Min.   :  9.321   Min.   : 37.04   Min.   : 19.89   Min.   : 23.72  
##  1st Qu.: 74.135   1st Qu.: 96.52   1st Qu.: 53.89   1st Qu.: 63.95  
##  Median :135.461   Median :154.54   Median : 95.56   Median : 85.88  
##  Mean   :122.080   Mean   :170.59   Mean   :106.86   Mean   : 84.85  
##  3rd Qu.:176.752   3rd Qu.:228.60   3rd Qu.:128.13   3rd Qu.:108.98  
##  Max.   :227.641   Max.   :330.81   Max.   :269.29   Max.   :136.09
# Plot seasonal Time Series (with and without facets)
p1= autoplot(arrivals)
p2 = autoplot(arrivals , facets = TRUE) + geom_smooth() +
  labs("International arrivals to Australia",
       y = "Arrivals (in thousands)",
       x = NULL)
p1+p2

# Subseries plots by country
p1 = ggsubseriesplot(arrivals[,1]) + ggtitle("Japanese Arrivals")
p2 = ggsubseriesplot(arrivals[,2]) + ggtitle("New Zealand Arrivals")
p3 = ggsubseriesplot(arrivals[,3]) + ggtitle("UK Arrivals")
p4 = ggsubseriesplot(arrivals[,4]) + ggtitle("USA Arrivals")

p1 + p2 + p3 + p4

  • Japan : Arrivals into Australia increased till late 90s and then began to drop starting the turn of the millennium

  • New Zeland : Arrivals into Australia are increasing year on year with highest volume of passengers since 2005.

  • UK & USA : Arrivals into Australia increased @ the year 2000 and then on there seems to be a stagnation

autoplot(arrivals, facets = TRUE) +
  geom_smooth() +
  labs("International arrivals to Australia",
       y = "Arrivals (in thousands)",
       x = NULL)

4.0.7 Time Series Components

The various reasons or the forces which affect the values of an observation in a time series are the components of a time series. The four categories of the components of time series are Trend , Seasonal Variations , Cyclic Variations and Random or Irregular movements,

4.0.7.1 Trend

A trend in time series data refers to a long-term upward or downward movement in the data, indicating a general increase or decrease over time. It represents the underlying structure of the data, capturing the direction and magnitude of change over a longer period. The population, agricultural production, items manufactured, number of births and deaths, number of industry or any factory, number of schools or colleges are some of its example showing some kind of trend.

There are several types of trends in Time Series data :

  • Upward Trend: A trend that shows a general increase over time, where the values of the data tend to rise over time.

  • Downward Trend: A trend that shows a general decrease over time, where the values of the data tend to decrease over time.

  • Horizontal Trend: A trend that shows no significant change over time, where the values of the data remain constant over time.

  • Non-linear Trend: A trend that shows a more complex pattern of change over time, including upward or downward trends that change direction or magnitude over time.

  • Damped Trend: A trend that shows a gradual decline in the magnitude of change over time, where the rate of change slows down over time.

The time series data can have a combination of these types of trends or multiple trends present simultaneously. Accurately identifying and modeling the trend is a crucial step in time series analysis, as it can significantly impact the accuracy of forecasts and the interpretation of patterns in the data.

# Ant-Diabetic Drug Sales
autoplot(a10) +
  geom_smooth()

It is clearly trended, with a change in the slope of the trend around 1990.

4.0.7.2 Seasonality

Seasonality in time series data refers to patterns that repeat over a regular time period, such as a day, a week, a month, or a year. These patterns arise due to regular events, such as holidays, weekends, or the changing of seasons, and can be present in various types of time series data, such as sales, weather, or stock prices.

There are several types of seasonality in time series data, including:

  • Weekly Seasonality: A type of seasonality that repeats over a 7-day period and is commonly seen in time series data such as sales, energy usage, or transportation patterns.

  • Monthly Seasonality: A type of seasonality that repeats over a 30- or 31-day period and is commonly seen in time series data such as sales or weather patterns.

  • Annual Seasonality: A type of seasonality that repeats over a 365- or 366-day period and is commonly seen in time series data such as sales, agriculture, or tourism patterns.

  • Holiday Seasonality: A type of seasonality that is caused by special events such as holidays, festivals, or sporting events and is commonly seen in time series data such as sales, traffic, or entertainment patterns.

It’s important to note that time series data can have multiple types of seasonality present simultaneously, and accurately identifying and modeling the seasonality is a crucial step in time series analysis.

These variations come into play either because of the natural forces or man-made conventions. The various seasons or climatic conditions play an important role in seasonal variations. Such as production of crops depends on seasons, the sale of umbrella and raincoats in the rainy season, and the sale of electric fans and A.C. shoots up in summer seasons.

Let’s visualize the Quarterly Clay brick production dataset

# Quarterly Australian clay brick production
autoplot(bricksq) + geom_smooth()

Let’s visualize the same by dataset applying the seasonal plot method

# Quarterly clay brick production
ggseasonplot(bricksq, year.labels=FALSE, continuous=TRUE)

Here, we see that brick production has consistently increased over the years as the lower (darker) lines represent earlier years and the higher (lighter) lines represent recent years. Also, we see that brick production tends to be the lowest in Q1 and typically peaks in Q3 before leveling off or decreasing slightly in Q4

4.0.7.3 Cyclicity

Cyclicity in time series data refers to the repeated patterns or periodic fluctuations that occur in the data over a specific time interval. It can be due to various factors such as seasonality (daily, weekly, monthly, yearly), trends, and other underlying patterns.

Difference between Seasonality and Cyclicity

Seasonality refers to a repeating pattern in the data that occurs over a fixed time interval, such as daily, weekly, monthly, or yearly. Seasonality is a predictable and repeating pattern that can be due to various factors such as weather, holidays, and human behavior.

Cyclicity, on the other hand, refers to the repeated patterns or fluctuations that occur in the data over an unspecified time interval. These patterns can be due to various factors such as economic cycles, trends, and other underlying patterns. Cyclicity is not limited to a fixed time interval and can be of different frequencies, making it harder to identify and model.

For example, a business cycle might last 3 or 5 or 8 years between peaks or troughs. But a seasonal pattern is always of the same length

  • We need to distinguish between seasonal and cyclic patterns as very different time series models are used in each case.

  • Seasonal patterns have constant length, while cyclic patterns have variable length.

  • If both exist together, then the average length of the cyclic pattern is longer the length of the seasonal pattern.

  • The size of the cycles tends to be more variable than the size of the seasonal fluctuations. As a result, it is much harder to predict cyclic data than seasonal data.

Let’s visualize the Annual Lynx trappings dataset. This is the number of lynx trapped annually in the Hudson Bay region of Canada from 1821 to 1934.

# Annual Canadian Lynx trappings 1821-1934
autoplot(lynx) + geom_smooth()

Because this is annual data, it cannot be seasonal. The population of lynx rises when there is plenty of food, and when the food supply gets low, they stop breeding causing the population to plummet. The surviving lynx then have plenty of food, start to breed again, and the cycle continues. The length of these cycles varies from between 8 and 11 years.

This is also a good example to show how variable the magnitude of cyclic patterns can be, with the largest peak being more than three times the size of the smallest peak

Let’s visualize the quarterly cement production dataset

# Quarterly Cemenet production
autoplot(qcement) + geom_smooth()

We can see a few cycles in our cement data in the early ’80s, ’90s, ’00s, and around 2008 - all these date ranges are around economic depressions that occurred

4.0.7.4 Exercise

Let’s analyze the dengue cases dataset which has number of dengue cases collected in the Philippines.

# load the dataset
df <- read.csv("Data/denguecases.csv")

# top 5 rows
head(df)
##   Month Year   Region Dengue_Cases
## 1   Jan 2008 Region.I     2.953926
## 2   Feb 2008 Region.I     2.183336
## 3   Mar 2008 Region.I     0.972410
## 4   Apr 2008 Region.I     9.357156
## 5   May 2008 Region.I     7.320599
## 6   Jun 2008 Region.I     4.513452
# bottom 5 rows
tail(df)
##      Month Year Region Dengue_Cases
## 1831   Jul 2016    NCR     7.982075
## 1832   Aug 2016    NCR    10.126983
## 1833   Sep 2016    NCR     9.900541
## 1834   Oct 2016    NCR     8.460119
## 1835   Nov 2016    NCR     7.541772
## 1836   Dec 2016    NCR     7.831114

The data has been collected for all 12 months of different years and the years range from 2008 going down 2016

Let’s aggregate the data by Month and Year

#Aggregate regional data to represent the entire country
df_agg=aggregate(Dengue_Cases~Month+Year,df,sum)

Let’s now convert the Dengue-Cases data to a time series with a frequency of 12 (monthly) , starting at 2008-Jan and ending at 2016-Dec

# creating a time series object
dengue_ts = ts(df_agg$Dengue_Cases , 
                frequency = 12 , 
                start = c(2008,1) , 
                end = c(2016,12))
glimpse(dengue_ts)
##  Time-Series [1:108] from 2008 to 2017: 131.1 160 93.7 49.4 79.9 ...

Let’s plot the time series with a smoother line to visualize the trend

# visualizing dengue cases ovet time
autoplot(dengue_ts) +
  geom_smooth() +
  labs("Monthly Dengue cases in Philipines",
       y = "Number of Cases)",
       x = NULL)

The number of cases they were more, more or less consistent from 2008 to 2010, maybe a slight increase, and then followed by a definite increase from, 2010 all the way to 2014 to 2016, the then cases have been on a decline

5 Time Series Smoothing & Decomposition

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category.

A time series can be considered composed of 4 main parts: trendcycleseasonality, and the irregular or remainder/residual part.

The Trend component is the longest-term behavior of a time series. The simplest model for a trend is a linear increase or decrease, but the trend cannot be linear

The trend component of the series is often considered along with the cyclic one (trend-cycle). The cyclical component is represented by fluctuations (rises and falls) not occurring at a fixed frequency. The cycle component is therefore different from the seasonal variation (see below) in that it does not follow a fixed calendar frequency

The Seasonal component is a repeated pattern occurring at a fixed time period such as the time of the year or the day of the week (the frequency of seasonality, which is always a fixed and known frequency)

The Irregular or remainder/residual component is the random-like part of the series.In general, when we fit mathematical models to time series data, the residual error series represents the discrepancies between the fitted values, calculated from the model, and the data

We usually combine the trend and cycle into a single trend-cycle component (sometimes called the trend for simplicity). Thus we think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series)

Decomposition methods try to identify and separate the above mentioned parts of a time series. Usually they consider together the trend and cycle (trend-cycle) - the longer-term changes in the series - and the (seasonal factors) - periodic fluctuations of constant length happening at a specific calendar frequency.

There are two main ways through which these elements can be combined together: in the additive and the multiplicative form

  • The additive model (xt=mt+st+zt) , where xt is the observed time series, mt is the trend-cycle component, st is the seasonal component and zt is the residual) is useful when the seasonal variation is relatively constant over time

  • The multiplicative model (xt=mt∗st∗zt) is useful when the the seasonal effects tends to increase as the trend increases.

The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decomposition’s are common with economic time series.

5.1 Classical Decomposition

One of the main objectives for a decomposition is to estimate seasonal effects that can be used to create and present seasonally adjusted values.

A seasonally adjusted value removes the seasonal effect from a value so that trends can be seen more clearly. For instance, in many regions of the U.S. unemployment tends to decrease in the summer due to increased employment in agricultural areas. Thus a drop in the unemployment rate in June compared to May doesn’t necessarily indicate that there’s a trend toward lower unemployment in the country. To see whether there is a real trend, we should adjust for the fact that unemployment is always lower in June than in May.

The classical method of time series decomposition originated in the 1920s and was widely used until the 1950s. It still forms the basis of many time series decomposition methods, so it is important to understand how it works.

Each data point can be replaced with the mean of that observation and one observation before and after it. This is called a centered moving average. A centered moving average is defined as

St = (Yt-q + … + Yt + … + Yt+q)/(2q + 1)

where St is the smoothed value at time t and k = 2q + 1 is the number of observations that are averaged. The k value is usually chosen to be an odd number. By necessity, when using a centered moving average, you lose the (k – 1) / 2 observations at each end of the series.

Several functions in R can provide a simple moving average, including SMA() in the TTR package, rollmean() in the zoo package, and ma() in the forecast package. Here, you’ll use the ma() function to smooth the Nile time series that comes with the base R installation.

The Nile time series which has the annual flow of the river Nile at Aswan from 1871–1970.

We will examine the time series with k = 3 , 7 and 15

# quick data review
glimpse(Nile)
##  Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
# Basic Time series
ylim <- c(min(Nile), max(Nile))
 
autoplot(Nile) +
  ggtitle("Raw time series") +
  scale_y_continuous(limits=ylim)

This appears to be a classical “Non-Seasonal Time Series”

# Plot the basic series
ylim <- c(min(Nile), max(Nile))
 
p1 = autoplot(Nile) +
  ggtitle("Raw time series") +
  scale_y_continuous(limits=ylim)
# Moving average k = 3
p2 = autoplot(ma(Nile, 3)) + 
  ggtitle("Simple Moving Averages (k=3)") +
  scale_y_continuous(limits=ylim)
# Moving average k = 7
p3 = autoplot(ma(Nile, 7)) +
  ggtitle("Simple Moving Averages (k=7)") +
  scale_y_continuous(limits=ylim)
# Moving average k = 15
p4 = autoplot(ma(Nile, 15)) +
  ggtitle("Simple Moving Averages (k=15)") +
  scale_y_continuous(limits=ylim)
p1 + p2 + p3 + p4

As k increases, the plot becomes increasingly smoothed. The challenge is to find the value of k that highlights the major patterns in the data without under- or over-smoothing. This is more art than science, and you’ll probably want to try several values of k before settling on one. From the plots , there certainly appears to have been a drop in river flow between 1892 and 1900 , with a small increasing trend between 1941 and 1961, but this could also have been a random variation.

Let us look at another time series the file http://robjhyndman.com/tsdldata/misc/kings.dat contains data on the age of death of successive kings of England, starting with William the Conqueror (original source: Hipel and Mcleod, 1994).

# read the data
kings <- scan("http://robjhyndman.com/tsdldata/misc/kings.dat",skip=3)
kings
##  [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69 59 48
## [26] 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
# convert to a time series
kingstimeseries <- ts(kings)
kingstimeseries
## Time Series:
## Start = 1 
## End = 42 
## Frequency = 1 
##  [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69 59 48
## [26] 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
# Basic time series
ylim <- c(min(kingstimeseries), max(kingstimeseries))
 
autoplot(kingstimeseries) +
  ggtitle("Raw time series") +
  scale_y_continuous(limits=ylim)

Let’s try to smooth the time series using a simple moving average of order 3, and plot the smoothed time series data

# Moving average k = 3
autoplot(ma(kingstimeseries , 3)) + 
  ggtitle("Simple Moving Averages (k=3)") +
  scale_y_continuous(limits=ylim)

There still appears to be quite a lot of random fluctuations in the time series smoothed using a simple moving average of order 3. Thus, to estimate the trend component more accurately, we might want to try smoothing the data with a simple moving average of a higher order. This takes a little bit of trial-and-error, to find the right amount of smoothing. For example, we can try using a simple moving average of order 8

# Moving average k = 8
autoplot(ma(kingstimeseries , 8)) + 
  ggtitle("Simple Moving Averages (k=8)") +
  scale_y_continuous(limits=ylim)

The data smoothed with a simple moving average of order 8 gives a clearer picture of the trend component, and we can see that the age of death of the English kings seems to have decreased from about 55 years old to about 38 years old during the reign of the first 20 kings, and then increased after that to about 73 years old by the end of the reign of the 40th king in the time series.

5.2 Seasonal Decomposition

For time-series data with a periodicity greater than 1 (that is, with a seasonal component), we’d want to go beyond a description of the overall trend. You can use seasonal decomposition to examine both seasonal and general trends.

Time-series data that have a seasonal aspect (such as monthly or quarterly data) can be decomposed into a trend component, a seasonal component, and an irregular component. The trend component captures changes in level over time. The seasonal component captures cyclical effects due to the time of year. The irregular (or error) component captures those influences not described by the trend and seasonal effects.The decomposition can be additive or multiplicative

An example may make the difference between additive and multiplicative models clearer. Consider a time series that records the monthly sales of motorcycles over a 10-year period.

In a model with an additive seasonal effect, the number of motorcycles sold tends to increase by 500 in November and December (due to the Christmas rush) and decrease by 200 in January (when sales tend to slow down). The seasonal increase or decrease is independent of the current sales volume.

In a model with a multiplicative seasonal effect, motorcycle sales in November and December tend to increase by 20% and decrease in January by 10%. In the multiplicative case, the impact of the seasonal effect is proportional to the current sales volume, which isn’t the case in an additive model. In many instances, the multiplicative model is more realistic.

A popular method for decomposing a time series into trend, seasonal, and irregular components is seasonal decomposition by loess smoothing. In R, this can be accomplished with the stl() function. The format is stl(ts, s.window=, t.window=)

ts is the time series to be decomposed.

t.window and s.window control how rapidly the trend-cycle and seasonal components can change. Smaller values allow for more rapid changes.

Both t.window and s.window should be odd numbers; t.window is the number of consecutive observations to be used when estimating the trend-cycle; s.window is the number of consecutive years to be used in estimating each value in the seasonal component.

The user must specify s.window as there is no default. Setting it to be infinite is equivalent to forcing the seasonal component to be periodic (i.e., identical across years).

Specifying t.window is optional, and a default value will be used if it is omitted.

The stl() function can only handle additive models, but this isn’t a serious limitation.

Multiplicative models can be transformed into additive models using a log transformation:

log(Yt) = log(Trendt × Seasonalt × Irregulart)

= log(Trendt) + log(Seasonalt) + log(Irregulart)

After fitting the additive model to the log-transformed series, the results can be back-transformed to the original scale. Let’s look at an example.

The time series AirPassengers comes with a base R installation and describes the monthly totals (in thousands) of international airline passengers between 1949 and 1960

# Basic Time series
autoplot(AirPassengers) + 
  ggtitle("Raw time series")

From the graph, it appears that variability (height of peaks and valleys) of the series increases with the level, suggesting a multiplicative model

Let’s display the time series created by taking the log of each observation

# log transformation
lAirPassengers <- log(AirPassengers)
autoplot(lAirPassengers, ylab="log(AirPassengers)") + 
  ggtitle("Log transformed time series")

The variance has stabilized, and the logged series looks like an appropriate candidate for an additive decomposition

Let us decompose it with parameters as s.window = periodic and s.window = 13 and save in an object called fit and plot it

# fit the decomposed model
fit <- stl(lAirPassengers, s.window="period")

# plot the decomposed series
autoplot(fit)

Plotting the results gives the graph , which shows the time series, seasonal, trend, and irregular components from 1949 to 1960. Note that the seasonal components have been constrained to remain the same across each year (using the s.window="period" option).

The trend is monotonically increasing, and the seasonal effect suggests more passengers in the summer (perhaps during vacations).

The grey bars on the right are magnitude guides - each bar represents the same magnitude. This is useful because the y-axes are different for each graph

The object returned by the stl() function includes a component called time .series that contains the trend, season, and irregular portion of each observation. In this case,  fit$time.series is based on the logged time series

# individual components
head(fit$time.series , n = 10)
##             seasonal    trend     remainder
## Jan 1949 -0.09164042 4.829389 -0.0192493585
## Feb 1949 -0.11402828 4.830368  0.0543447685
## Mar 1949  0.01586585 4.831348  0.0355884457
## Apr 1949 -0.01402759 4.833377  0.0404632511
## May 1949 -0.01502478 4.835406 -0.0245905300
## Jun 1949  0.10978976 4.838166 -0.0426814256
## Jul 1949  0.21640041 4.840927 -0.0601151688
## Aug 1949  0.20960587 4.843469 -0.0558624690
## Sep 1949  0.06747156 4.846011 -0.0008273977
## Oct 1949 -0.07024836 4.850883 -0.0015112948

exp(fit$time .series) converts the decomposition back to the original metric

# back transform the decomposed series
head(exp(fit$time.series) , n = 15)
##           seasonal    trend remainder
## Jan 1949 0.9124332 125.1344 0.9809347
## Feb 1949 0.8922327 125.2571 1.0558486
## Mar 1949 1.0159924 125.3798 1.0362293
## Apr 1949 0.9860703 125.6345 1.0412930
## May 1949 0.9850875 125.8897 0.9757094
## Jun 1949 1.1160434 126.2377 0.9582166
## Jul 1949 1.2415994 126.5866 0.9416561
## Aug 1949 1.2331919 126.9088 0.9456692
## Sep 1949 1.0697998 127.2318 0.9991729
## Oct 1949 0.9321623 127.8533 0.9984898
## Nov 1949 0.8077298 128.4777 1.0021654
## Dec 1949 0.9042619 129.6173 1.0067574
## Jan 1950 0.9124332 130.7670 0.9638262
## Feb 1950 0.8922327 132.0634 1.0693259
## Mar 1950 1.0159924 133.3726 1.0405479

Examining the seasonal effects suggests that the number of passengers increased by 24% in July (a multiplier of 1.24) and decreased by 20% in November (a multiplier of .80).

Let’s look at the plots by month and the seasonal plots

# month plot
ggmonthplot(AirPassengers)  +                         
  labs(title="Month plot: AirPassengers",             
       x="",                                          
       y="Passengers (thousands)")

# seasonal plot
p = ggseasonplot(AirPassengers) + geom_point() +     
  labs(title="Seasonal plot: AirPassengers",          
       x="",                                          
       y="Passengers (thousands)")
direct.label(p)

The month plot displays the subseries for each month (all January values connected, all February values connected, and so on), along with the average of each subseries. From this graph, it appears that the trend is increasing for each month in a roughly uniform way. Additionally, the greatest number of passengers occurs in July and August.

The season plot displays the subseries by year. Again, you see a similar pattern, with increases in passengers each year and the same seasonal pattern. By default, the ggplot2 package would create a legend for the year variable. The directlabels package is used to place the year labels directly on the graph, next to each line in the time series.

6 Autocorrelation & White Noise

Another way to look at time series data is to plot each observation against another observation that occurred some time previously by using gglagplot(). For example, you could plot yt against yt−1. This is called a lag plot because you are plotting the time series against lags of itself.

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series

For example, values that are separated by an interval might have a strong positive or negative correlation. When these correlations are present, they indicate that past values influence the current value. Analysts use the autocorrelation and partial autocorrelation functions to understand the properties of time series data, fit the appropriate models, and make forecasts.

The autocorrelation function (ACF) assesses the correlation between observations in a time series for a set of lags. The ACF for time series y is given by: Corr (yt,yt−k), k=1,2,…

Use the autocorrelation function (ACF) to identify which lags have significant correlations, understand the patterns and properties of the time series, and then use that information to model the time series data. From the ACF, you can assess the randomness and stationarity of a time series. You can also determine whether trends and seasonal patterns are present.

The ggAcf() function produces ACF plots sometimes known as a correlogram

6.1 ACF - non-seasonal time series

Let us look at Annual oil production (millions of tonnes), Saudi Arabia, 1965-2013

# Create an autoplot of the oil data
autoplot(oil)

Let us plot the relationship between yt and yt−k, k=1,…,9 using gglagplot()

# Create a lag plot of oil data
gglagplot(oil)

Let us plot the correlations between yt and yt−k, k=1,…,9 using ggAcf()

# Create an ACF plot of the oil data
ggAcf(oil)

We can extract the correlation coefficients for lags by specifying plot = FALSE

# Extract the correlation coefficients for various lags
ggAcf(oil , plot = FALSE)
## 
## Autocorrelations of series 'oil', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000  0.866  0.710  0.545  0.387  0.244  0.096 -0.007 -0.103 -0.150 -0.142 
##     11     12     13     14     15     16 
## -0.142 -0.081 -0.015  0.042  0.113  0.178
  • Looking at the lag plot , strongest correlation is exhibited at “Lag=1”.

  • Looking at ACF Plot , r1 is higher than for the other lags. This is due to the absence of seasonal pattern in the data

  • The dashed blue lines indicate whether the correlations are significantly different from zero

Lets look at the quarterly beer production in Australia (in megalitres) from 1956:Q1 to 2010:Q2. Let’s extract the data from 1992 onward

# Create a plot of the quarterly beer production
beer2 <- window(ausbeer, start=1992)
autoplot(beer2)

Let’s plots the relations between lags using the lag plot

# lag plot
gglagplot(beer2)

Let’s extract the correlation coefficients across lags

# extract correlation coefficients
ggAcf(beer2 , plot = FALSE)
## 
## Autocorrelations of series 'beer2', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000 -0.102 -0.657 -0.060  0.869 -0.089 -0.635 -0.054  0.832 -0.108 -0.574 
##     11     12     13     14     15     16     17     18 
## -0.055  0.774 -0.080 -0.568 -0.066  0.706 -0.065 -0.528

Let’s plot the ACF

# ACF plot
ggAcf(beer2)

  • The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data.

  • The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2)

  • r4 is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be four quarters apart.

  • r2 is more negative than for the other lags because troughs tend to be two quarters behind peaks.

6.2 ACF - seasonal time series

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive values that slowly decrease as the lags increase.

When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags.

When data are both trended and seasonal, you see a combination of these effects

Let’s investigate these phenomenon by plotting the annual sunspot series (which follows the solar cycle of approximately 10-11 years) in sunspot.year and the daily traffic to the Hyndsight blog (which follows a 7-day weekly pattern) in hyndsight

# Plot the annual sunspot numbers
autoplot(sunspot.year)

ggAcf(sunspot.year)

ggAcf(sunspot.year , plot = FALSE)
## 
## Autocorrelations of series 'sunspot.year', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000  0.814  0.447  0.043 -0.262 -0.408 -0.361 -0.158  0.141  0.436  0.607 
##     11     12     13     14     15     16     17     18     19     20     21 
##  0.604  0.435  0.168 -0.094 -0.281 -0.346 -0.298 -0.149  0.053  0.246  0.371 
##     22     23     24 
##  0.380  0.265  0.065

Here we observe the maximum autocorrelation at lag = 1

# Plot the traffic on the Hyndsight blog
autoplot(hyndsight)

ggAcf(hyndsight)

ggAcf(hyndsight , plot = FALSE)
## 
## Autocorrelations of series 'hyndsight', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000  0.651  0.231  0.030  0.001  0.193  0.544  0.734  0.532  0.164 -0.016 
##     11     12     13     14     15     16     17     18     19     20     21 
## -0.018  0.141  0.468  0.660  0.460  0.116 -0.045 -0.034  0.125  0.456  0.631 
##     22     23     24     25 
##  0.421  0.094 -0.071 -0.070

Here we observe the maximum autocorrelation at lag = 7

Let’s examine the monthly Australian electricity consumption time series aelec starting 1990

# subset the data to start from 1990
aelec <- window(elec, start=1980)

# plot the original time series
autoplot(aelec) + xlab("Year") + ylab("GWh")

Let’s decompose and plot the time-series

# decomposition
aelec %>%
  stl(t.window=13, s.window="periodic") %>%
  autoplot()

Let’s plot the ACF

ggAcf(aelec)

ggAcf(aelec , plot = FALSE)
## 
## Autocorrelations of series 'aelec', by lag
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 1.000 0.936 0.888 0.823 0.776 0.758 0.726 0.746 0.748 0.767 0.805 0.820 0.845 
##    13    14    15    16    17    18    19    20    21    22    23    24 
## 0.787 0.741 0.678 0.634 0.618 0.590 0.606 0.611 0.626 0.659 0.670 0.692

The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality

6.3 White Noise

Time series that show no autocorrelation are called White Noise. It is just random, independent and identically distributed (iid) observations. In short-hand, statisticians often call this “iid data”. In other words, there is nothing going on - no trend, no seasonality, no cyclicity, not even any auto-correlations. Just randomness. The name “White Noise” comes from physics where white light has some similar mathematical characteristics

Here is an example of a White Noise Time Series

set.seed(30)
y <- ts(rnorm(50))
autoplot(y) + ggtitle("White noise")

Let’s plot the autocorrelations

ggAcf(y)

The autocorrelation function of white noise consists of many insignificant spikes. Because the data is simply random, we expect correlations between observations to be close to zero.

The dashed blue lines are there to show us how large a spike has to be before we can consider it significantly different from zero

Let’s examine the time series data pigs which has Monthly total number of pigs slaughtered in Victoria, Australia (Jan 1980 – Aug 1995)

# basic plot
pigs <- window(pigs, start=1990)

autoplot(pigs/1000)+  
  xlab("Year")+  
  ylab("thousands")+  
  ggtitle("Monthly number of pigs slaughtered in Victoria")

At first glance it looks relatively random. Possibly there’s a slight upward trend, but it is hard to see the difference between this and a white noise series on the basis of a time plot.

Let’s plot a decomposed version of the original time series

# decomposition
pigs %>%
  stl(t.window=13, s.window="periodic") %>%
  autoplot()

Evidently there exists seasonality and and a more or less cyclic trend

Let’s look at the ACF plot

# correlation plot
ggAcf(pigs)+  
  ggtitle("ACF of monthly pigs slaughtered in Victoria")

When we look at the ACF plot of the same data, we can see there is actually some information in the data. The first three spikes are significantly larger than zero. So we can be confident that this is not a white noise series and thet there is some information in the data that can be used in building a forecasting model

6.4 Testing for Autocorrelation

Looking at an ACF is useful, but sometimes it is easier to test all the auto-correlations together, rather than consider each one separately. To do this, we can use a Ljung-Box test.It considers the first h autocorrelation values to see if they, as a group, look like what you would expect from a white noise series.

The Ljung (pronounced Young) Box test (sometimes called the modified Box-Pierce, or just the Box test) is a way to test for the absence of serial autocorrelation, up to a specified lag k.

The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the auto-correlations for the errors or residuals are non zero. Essentially, it is a test of lack of fit: if the auto-correlations of the residuals are very small, we say that the model doesn’t show ‘significant lack of fit’.

The Null Hypothesis H0 is that the residuals are independently distributed. The Alternative Hypothesis HA is that the residuals are not independently distributed and exhibit a serial correlation.

A significant p-value rejects the null hypothesis that the time series isn’t autocorrelated.

Let’s test whether the pigs time series exhibits autocorrelation for the 1st 24 lags

# Testing for serial autocorrelation
Box.test(pigs, lag =24, fitdf =0, type ="Lj")
## 
##  Box-Ljung test
## 
## data:  pigs
## X-squared = 42.815, df = 24, p-value = 0.01044

Here the p-value is very small, again suggesting that this is not a white noise series

6.5 Stock prices and white noise

There is a well-known result in economics called the “Efficient Market Hypothesis” that states that asset prices reflect all available information. A consequence of this is that the daily changes in stock prices should behave like white noise (ignoring dividends, interest rates and transaction costs). The consequence for forecasters is that the best forecast of the future price is the current price.

Let’s test this hypothesis by looking at the goog series, which contains the closing stock price for Google over 1000 trading days 25-Feb-2013 through 13-Feb-2017.

# Plot the original series
autoplot(goog) + 
  ggtitle("Closing stock prices of Google over 1000 days")

We are interested in daily changes in stock prices. Let’s apply the diff function to calculate the daily change in prices and plot the same

# Plot the differenced series
autoplot(diff(goog)) + 
  ggtitle("Daily changes in Google stock prices")

Now let’s plot the ACF to check if daily changes are “White Noise”

# ACF of daily changes
ggAcf(diff(goog))

# correlation coefficients
ggAcf(diff(goog) , plot = FALSE)
## 
## Autocorrelations of series 'diff(goog)', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000  0.039 -0.004 -0.065 -0.056 -0.027  0.014 -0.012 -0.053 -0.018 -0.009 
##     11     12     13     14     15     16     17     18     19     20     21 
##  0.003 -0.035  0.006  0.000 -0.028  0.029  0.059  0.006  0.015  0.054  0.035 
##     22     23     24     25     26     27     28     29 
## -0.004 -0.025 -0.015 -0.089 -0.030  0.001  0.046 -0.061

We do not observe any large positive coefficients and other than a negative correlation at lag = 25 which seems in-significant. We can infer that the daily price changes are “not-correlated”

Let’s apply the Ljung Box test to the daily changes with lag = 10

# Ljung-Box test of the differenced series
Box.test(diff(goog), lag = 10, fitdf = 0, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  diff(goog)
## X-squared = 13.123, df = 10, p-value = 0.2169

We see that the Ljung-Box test was not significant, so daily changes in the Google stock price look like white noise