# load libraries
library(tidyverse) # meta package for manipulating and visualizing data frames
library(lubridate) # packagae for manipulating dates
library(readxl) # reading excel files
library(forecast) # meta package Time Series forecasting & Linear Models
library(fpp2) # data for "Forecasting: Principles and Practice" (
library(patchwork) #
library(ggplot2)
library(xts)
library(directlabels) # labeling plotsForecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.
The predictability of an event or a quantity depends on several factors including:
how well we understand the factors that contribute to it;
how much data is available;
whether the forecasts can affect the thing we are trying to forecast.
For example, forecasts of electricity demand can be highly accurate because all three conditions are usually satisfied. We have a good idea of the contributing factors: electricity demand is driven largely by temperatures, with smaller effects for calendar variation such as holidays, and economic conditions.
On the other hand, when forecasting currency exchange rates, only one of the conditions is satisfied: there is plenty of available data. However, we have a limited understanding of the factors that affect exchange rates, and forecasts of the exchange rate have a direct effect on the rates themselves. If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.
Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again
Forecasting situations vary widely in their time horizons, factors determining actual outcomes, types of data patterns, and many other aspects. Forecasting methods can be simple, such as using the most recent observation as a forecast (which is called the naïve method), or highly complex, such as neural nets and econometric systems of simultaneous equations. Sometimes, there will be no data available at all. For example, we may wish to forecast the sales of a new product in its first year, but there are obviously no data to work with. In situations like this, we use judgmental forecasting.The choice of method depends on what data are available and the predictability of the quantity to be forecast.
Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning. However, business forecasting is often done poorly, and is frequently confused with planning and goals. They are three different things.
Forecasting is about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts.
Goals are what you would like to have happen. Goals should be linked to forecasts and plans, but this does not always occur. Too often, goals are set without any plan for how to achieve them, and no forecasts for whether they are realistic.
Planning is a response to forecasts and goals. Planning involves determining the appropriate actions that are required to make your forecasts match your goals.
Short-term forecasts are needed for the scheduling of personnel, production and transportation. As part of the scheduling process, forecasts of demand are often also required
Medium-term forecasts are needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.
Long-term forecasts are used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.
Examples of time series data include:
Annual Google profits
Quarterly sales results for Amazon
Monthly rainfall
Weekly retail sales
Daily IBM stock prices
Hourly electricity demand
References :
Hyndman, R.J. & Athanasopoulos, Forecasting principles and practice, 2nd edition, https://otexts.com/fpp2
Using R for Time Series Analysis https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
Applied Time Series Analysis STAT 510
Time Series Analysis With R https://nicolarighetti.github.io/Time-Series-Analysis-With-R/
Univariate Time Series is a sequence of measurements of the same variable collected over time. Most often, the measurements are made at regular time intervals.One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. One defining characteristic of a time series is that it is a list of observations where the ordering matters. Ordering is very important because there is dependency and changing the order could change the meaning of the data
The basic objective usually is to determine a model that describes the pattern of the time series. Uses for such a model are:
To describe the important features of the time series pattern.
To explain how the past affects the future or how two time series can “interact”.
To forecast future values of the series.
To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations.
There are two basic types of “time domain” models.
Models that relate the present value of a series to past values and past prediction errors - these are called ARIMA models (for Autoregressive Integrated Moving Average).
Ordinary regression models that use time indices as x-variables. These can be helpful for an initial description of the data and form the basis of several simple forecasting methods.
Some important questions to first consider when first looking at a time series are:
Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time?
Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on?
Are there outliers? In regression, outliers are far away from your line. With time series data, your outliers are far away from your other data.
Is there a long-run cycle or period unrelated to seasonality factors?
Is there constant variance over time, or is the variance non-constant?
Are there any abrupt changes to either the level of the series or the variance?
Predictor variables are often useful in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form
ED=f(current temperature, strength of economy, population,time of day, day of week, error).
The relationship is not exact — there will always be changes in electricity demand that cannot be accounted for by the predictor variables. The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. We call this an explanatory model because it helps explain what causes the variation in electricity demand.
Because the electricity demand data form a time series, we could also use a time series model for forecasting. In this case, a suitable time series forecasting equation is of the form
EDt+1=f(EDt,EDt−1,EDt−2,EDt−3,…,error) {t is the present hour, t+1 is the next hour, t−1 is the previous hour and so on. }
Here, prediction of the future is based on past values of a variable, but not on external variables which may affect the system. Again, the “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model.
Problem Definition : This step requires an understanding of the way the forecasts will be used, who requires the forecasts, and how the forecasting function fits within the organisation requiring the forecasts
Gathering Information : Two kinds of information required (a) statistical data, and (b) the accumulated expertise of the people who collect the data and use the forecasts. Often, it will be difficult to obtain enough historical data to be able to fit a good statistical model. In that case, the judgmental forecasting methods can be used. Occasionally, old data will be less useful due to structural changes in the system being forecast; then we may choose to use only the most recent data.
Preliminary Analysis (EDA) : Here we start by graphing the data and answering the following questions —
Are there consistent patterns?
Is there a significant trend?
Is seasonality important? Is there evidence of the presence of business cycles?
Are there any outliers in the data that need to be explained by those with expert knowledge?
How strong are the relationships among the variables available for analysis?
Choosing and Fitting Models : The model to use depends on the availability of historical data, the strength of relationships between the forecast variable and any explanatory variables, and the way in which the forecasts are to be used.These models include regression models , exponential smoothing methods , Box-Jenkins ARIMA models and several advanced methods including neural networks and vector autoregression.
Using and Evaluating Forecasting Models : Once a model has been selected and its parameters estimated, the model is used to make forecasts.A number of methods have been developed to help in assessing the accuracy of forecasts some of which include Root Mean Square Error (RMSE) , Mean Absolute Error (MAE) , Akaike’s Information Criterion (AIC) , Bayesian Information Criterion (BIC) and so on…
If you have to manage dates and times in any analytics role, it might get confusing pretty fast. Things like time zones, formats, leap year or leap seconds need to be considered.Depending on which type of task you want to perform with your date and time data, you have a lot of different tools available.We will take a look at standard tools in R base and then we dive deeper into lubricate, which is probably the most convenient way to handle this sort of data.
Main use cases of lubridate related to Time Series include :
Date and Time based calculations
Advanced format conversions
Changing time zone
Checking for leap years
Creating a date-time object
There are three types of date/time data that refer to an instant in time:
A date. Tibbles print this as <date>.
A time within a day. Tibbles print this as
<time>.
A date-time is a date plus a time: it uniquely identifies an
instant in time (typically to the nearest second). Tibbles print this as
<dttm>. Elsewhere in R these are called
POSIXct
To get the current date or date-time you can use today() or now()
## [1] "2023-07-31"
## [1] "2023-07-31 19:15:51 IST"
There are three ways to create a date/time:
From a string.
From individual date-time components.
From an existing date/time object.
Date/time data often comes as strings. You’ve seen one approach to parsing strings into date-times in date-times.
Another approach is to use the helpers provided by lubridate that automatically work out the format once you specify the order of the component. To use them, we identify the order in which year, month, and day appear in our dates, then arrange “y”, “m”, and “d” in the same order.
Let’s look at some examples of converting a date stored as a string into a date object applying the lubridate helpers
# Convert the string date to date object Year , Month , Day (helper = ymd)
my_date = ymd("2017-01-31")
print(my_date)## [1] "2017-01-31"
## [1] "Date"
# Convert the string date to date object Month Day Year (helper = mdy)
my_date_2 = mdy("January 31st, 2017")
print(my_date_2)## [1] "2017-01-31"
## [1] "Date"
# Convert the string date to date object Day Month Year (helper = dmy)
my_date_3 = dmy("31-Jan-2017")
print(my_date_3)## [1] "2017-01-31"
## [1] "Date"
The lubridate helpers can also take “unquoted objects” and convert them to a date object
## [1] "2017-01-31"
Instead of a single string, sometimes we may have the individual components of the date-time spread across multiple columns. Let’s look at the flights data which is part of the nycflights13 package
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We have the following columns :
year, month, day : Date of departure split into year , month and day
dep_time, arr_time : Actual departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time, sched_arr_time : Scheduled departure and arrival times (format HHMM or HMM), local tz.
hour, minute : Time of scheduled departure broken into hour and minutes
We will make a date time object from this sort of input, using make_date() for
dates, or make_datetime() for
date-times
Let’s extract only year , month , day , hour and minute columns and make the date object dep_date_time using the make-datetime() function from lubridate
# create the departure time object
flights %>%
select(year, month, day, hour, minute) %>%
mutate(dep_date_time = make_datetime(year, month, day, hour, minute)) %>% head()## # A tibble: 6 × 6
## year month day hour minute dep_date_time
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 05:58:00
Because we have constructed the departure time using hours and minutes , the object is of class POSIXct
Let’s create a departure date only column combining year , month and day
# create a departure day column
flights %>% select(year, month, day) %>%
mutate(dep_day = make_date(year, month, day)) %>% head()## # A tibble: 6 × 4
## year month day dep_day
## <int> <int> <int> <date>
## 1 2013 1 1 2013-01-01
## 2 2013 1 1 2013-01-01
## 3 2013 1 1 2013-01-01
## 4 2013 1 1 2013-01-01
## 5 2013 1 1 2013-01-01
## 6 2013 1 1 2013-01-01
We may want to switch between a date-time and a date. Here we apply
the as_datetime() and as_date()
functions
## [1] "2023-07-31 UTC"
## [1] "2023-07-31"
## [1] "2010-10-10" NA
The last object is not recognized as a date and is converted to a missing value
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")## [1] "January 1, 2010"
## [1] "2010-01-01"
## [1] "Date"
## [1] "2015-Mar-07"
## [1] "2015-03-07"
## [1] "Date"
## [1] "06-Jun-2017"
## [1] "2017-06-06"
## [1] "Date"
## [1] "August 19 (2015)" "July 1 (2015)"
## [1] "2015-08-19" "2015-07-01"
## [1] "Date"
Now that we know how to get date-time data into R’s date-time data structures, let’s explore what we can do with them.
Using accessor functions that let us get and set individual components.
Look at how arithmetic works with date-times.
You can pull out individual parts of the date with the accessor
functions year(), month(),
mday(),(day of the month), yday(),(day of the
year), wday(),(day of the
week), hour(), minute(), and second()
## [1] "2016-07-08 12:34:56 UTC"
## [1] 2016
## [1] 7
## [1] 8
## [1] 6
For month() and wday() you
can set label = TRUE to return the abbreviated name of the
month or day of the week. Set abbr = FALSE to return the
full name.
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
Let’s practice component extraction on a new dataset that contains transactions. The data set has 3 columns. All the dates are in the format (yyyy-mm-dd).
Invoice: invoice date
Due: due date
Payment: payment date
## # A tibble: 6 × 3
## Invoice Due Payment
## <date> <date> <date>
## 1 2013-01-02 2013-02-01 2013-01-15
## 2 2013-01-26 2013-02-25 2013-03-03
## 3 2013-07-03 2013-08-02 2013-07-08
## 4 2013-02-10 2013-03-12 2013-03-17
## 5 2012-10-25 2012-11-24 2012-11-28
## 6 2012-01-27 2012-02-26 2012-02-22
Notice how read_csv has already converted the Invoice Date column to date type
Lets add new columns with date, month and year from
the Due column
# create new columns
transactions %>%
mutate(
due_day = day(Due),
due_month = month(Due),
due_year = year(Due)
) %>% head()## # A tibble: 6 × 6
## Invoice Due Payment due_day due_month due_year
## <date> <date> <date> <int> <dbl> <dbl>
## 1 2013-01-02 2013-02-01 2013-01-15 1 2 2013
## 2 2013-01-26 2013-02-25 2013-03-03 25 2 2013
## 3 2013-07-03 2013-08-02 2013-07-08 2 8 2013
## 4 2013-02-10 2013-03-12 2013-03-17 12 3 2013
## 5 2012-10-25 2012-11-24 2012-11-28 24 11 2012
## 6 2012-01-27 2012-02-26 2012-02-22 26 2 2012
Let us estimate the number of days to settle the invoice by subtracting the date of invoice from the date of payment and then summarize it
# create the days_to_pay feature
invoice_due <- transactions %>%
mutate(days_to_pay = as.numeric(Payment - Invoice))
# top 10 rows
head(invoice_due , n = 10)## # A tibble: 10 × 4
## Invoice Due Payment days_to_pay
## <date> <date> <date> <dbl>
## 1 2013-01-02 2013-02-01 2013-01-15 13
## 2 2013-01-26 2013-02-25 2013-03-03 36
## 3 2013-07-03 2013-08-02 2013-07-08 5
## 4 2013-02-10 2013-03-12 2013-03-17 35
## 5 2012-10-25 2012-11-24 2012-11-28 34
## 6 2012-01-27 2012-02-26 2012-02-22 26
## 7 2013-08-13 2013-09-12 2013-09-09 27
## 8 2012-12-16 2013-01-15 2013-01-12 27
## 9 2012-05-14 2012-06-13 2012-07-01 48
## 10 2013-07-01 2013-07-31 2013-07-26 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 18.00 26.00 26.44 35.00 75.00
On an average , payments are delayed by ~27 days with a maximum of 75 days
The Normalized Difference Moisture Index (NDMI) detects moisture levels in vegetation using a combination of near-infrared (NIR) and short-wave infrared (SWIR) spectral bands. It is a reliable indicator of water stress in crops. Our dataset contains NDMI values recorded over the period May-1993 through Sep-2010. Let’s apply our component extraction skills to analyze this data
## Rows: 752
## Columns: 2
## $ Date <chr> "01-Aug-02", "01-Aug-04", "01-Aug-07", "01-Aug-93", "01-Jul-01", …
## $ NDMI <dbl> -0.067809326, -0.207316202, -0.145061074, -0.016266738, -0.136528…
Notice that the Date is stored as a String. Let’s convert it to Date type applying the dmy() function from lubridate
## # A tibble: 6 × 3
## Date NDMI Date_2
## <chr> <dbl> <date>
## 1 01-Aug-02 -0.0678 2002-08-01
## 2 01-Aug-04 -0.207 2004-08-01
## 3 01-Aug-07 -0.145 2007-08-01
## 4 01-Aug-93 -0.0163 1993-08-01
## 5 01-Jul-01 -0.137 2001-07-01
## 6 01-Jul-05 0.0721 2005-07-01
## # A tibble: 6 × 3
## Date NDMI Date_2
## <chr> <dbl> <date>
## 1 01-Aug-02 -0.0678 2002-08-01
## 2 01-Aug-04 -0.207 2004-08-01
## 3 01-Aug-07 -0.145 2007-08-01
## 4 01-Aug-93 -0.0163 1993-08-01
## 5 01-Jul-01 -0.137 2001-07-01
## 6 01-Jul-05 0.0721 2005-07-01
Let’s extract the Year and Month from Date2 into new columns
# createing year and month of observation
NDMI <- NDMI %>%
mutate(Year = year(Date_2) ,
Month = month(Date_2))
head(NDMI)## # A tibble: 6 × 5
## Date NDMI Date_2 Year Month
## <chr> <dbl> <date> <dbl> <dbl>
## 1 01-Aug-02 -0.0678 2002-08-01 2002 8
## 2 01-Aug-04 -0.207 2004-08-01 2004 8
## 3 01-Aug-07 -0.145 2007-08-01 2007 8
## 4 01-Aug-93 -0.0163 1993-08-01 1993 8
## 5 01-Jul-01 -0.137 2001-07-01 2001 7
## 6 01-Jul-05 0.0721 2005-07-01 2005 7
Let’s summarize the mean NDMI values by month and year applying the aggregate function
# mean NDMI by month and year
NDMI_AGG <- aggregate(NDMI~Month+Year,
data = NDMI,
FUN = mean)
# top 10 values
head(NDMI_AGG , n = 10)## Month Year NDMI
## 1 5 1993 -0.09658576
## 2 6 1993 -0.04912763
## 3 7 1993 0.01793603
## 4 8 1993 0.04755371
## 5 9 1993 0.04672308
## 6 5 1994 -0.04652719
## 7 6 1994 -0.01155239
## 8 7 1994 0.06060317
## 9 8 1994 -0.01783854
## 10 9 1994 0.01663222
Look at the 1st observation : For the 5th Month of Year 1993 the NDMI values were on an average -0.0965
The data for the time series is stored in an R object called time-series object. It is also a R data object like a vector or data frame. The time series object is created by using the ts() function.
While you can have data containing dates and corresponding values in an R object of any other class such as a data frame, creating objects of ts class offers many benefits such as the time index information. Also, when you plot a ts object, it automatically creates a plot over time
Syntax : timeseries.object.name <- ts(data, start, end, frequency)
The function ts() takes in three arguments:
data is set to everything in
the dataframe / vector except for the date column; it isn’t
needed since the ts object will store time information
separately.
start indicates the start time of the first
observation {1st available value of the 1st
cycle}
frequency is set to indicate periodicity of the
observations{# of observations per cycle}
The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as below −
frequency = 12 pegs the data points for every month of a year.
frequency = 4 pegs the data points for every quarter of a year.
frequency = 6 pegs the data points for every 10 minutes of an hour.
frequency = 24*6 pegs the data points for every 10 minutes of a day.
frequency = 52 pegs the data points for every week for a 52 week period
If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25). Similarly, data that are observed every minute might have an hourly seasonality (frequency=60), a daily seasonality (frequency=24×60=1440), a weekly seasonality (frequency=24×60×7=10080) and an annual seasonality (frequency=24×60×365.25=525960)
While specifying the start and frequency , care should be taken about the cycle and number of measurements per cycle.
Let us create a Time-Series Data-set of 50 evenly distributed observations with a min of 10 and a max of 45. Let us assume it is Quarterly Revenue starting from 1956.
# Convert Vector to Time-Series "ts" class
# Data starts in 1956 - 4 observations/year (quarterly)
mytimeseries = ts(data = mydata,
start = 1956,
frequency = 4)Let us check the class to validate if it is TS or not
## [1] "ts"
Let us now check the “timestamp” for each data point
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 1956.00 1956.25 1956.50 1956.75
## 1957 1957.00 1957.25 1957.50 1957.75
## 1958 1958.00 1958.25 1958.50 1958.75
## 1959 1959.00 1959.25 1959.50 1959.75
## 1960 1960.00 1960.25 1960.50 1960.75
## 1961 1961.00 1961.25 1961.50 1961.75
## 1962 1962.00 1962.25 1962.50 1962.75
## 1963 1963.00 1963.25 1963.50 1963.75
## 1964 1964.00 1964.25 1964.50 1964.75
## 1965 1965.00 1965.25 1965.50 1965.75
## 1966 1966.00 1966.25 1966.50 1966.75
## 1967 1967.00 1967.25 1967.50 1967.75
## 1968 1968.00 1968.25
We have 4 Quarters every year
Let us read in some time series data from an xlsx file
using read_excel(), a function from
the readxl package, and store the data as
a ts object
Let us look at the Top 6 Rows of the data
## # A tibble: 6 × 4
## ...1 Sales AdBudget GDP
## <chr> <dbl> <dbl> <dbl>
## 1 Mar-81 1020. 659. 252.
## 2 Jun-81 889. 589 291.
## 3 Sep-81 795 512. 291.
## 4 Dec-81 1004. 614. 292.
## 5 Mar-82 1058. 647. 279.
## 6 Jun-82 944. 602 254
There are four observations in 1981, indicating quarterly data with a frequency of four rows per year.
The first observation or start date is Mar-81,
the first of four rows for year 1981, indicating that March
corresponds with the first period.
Let us convert it to a TS object by specifying start as 1981 and frequency as quarterly
# ts for class time series
# Data starts in 1956 - 4 observations/year (quarterly)
myts = ts(data = mydata[,2:4], ## All columns starting from 2nd column
start = c(1981,1), ## Start Year 1981 , Start Period 1st Quarter
frequency = 4) ## Quarterly hence freq = 4
glimpse(myts)## Time-Series [1:100, 1:3] from 1981 to 2006: 1020 889 795 1004 1058 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:3] "Sales" "AdBudget" "GDP"
We have now created a Time Series Object with the index as the 1st column having Quarter and Year
## Sales AdBudget GDP
## 1981 Q1 1020.2 659.2 251.8
## 1981 Q2 889.2 589.0 290.9
## 1981 Q3 795.0 512.5 290.8
## 1981 Q4 1003.9 614.1 292.4
## 1982 Q1 1057.7 647.2 279.1
## 1982 Q2 944.4 602.0 254.0
The first step in any data analysis task is to plot the data. Graphs enable you to visualize many features of the data, including patterns, unusual observations, changes over time, and relationships between variables. Just as the type of data determines which forecasting method to use, it also determines which graphs are appropriate.
You can use the autoplot() function to produce a time
plot of the data with or without facets, or panels that display
different subsets of data
example : autoplot(usnim_2002, facets = FALSE) , will plot the “trend lines” for every time-series
The ggplot2 package serves our purpose for creating stunning Time Series visualizations.The latest version of the forecast package integrates ggplot2 to produce time series relevant plots.
A time series data set requires a line chart.Since this sort of chart implies a certain order along a timeline, the timeline is usually on the x axis i.e. the observations are plotted against the time of observation, with consecutive observations joined by straight lines.
Let us plot the data you stored as myts using
autoplot() with and without faceting.
When we use “faceting” , the Y-Axis is scaled by individual “Time Series Observations”
Without faceting , the Y-Axis has a constant Scale and individual Time Series can be compared
Let us look at another dataset containing stock movements of 5 stocks
## Rows: 4,001
## Columns: 6
## $ Date <date> 2001-07-05, 2001-07-06, 2001-07-09, 2001-07-10, 2001-07-11, 2001…
## $ AAPL <dbl> 1.66, 1.57, 1.62, 1.51, 1.61, 1.74, 1.78, 1.71, 1.79, 1.48, 1.43,…
## $ AMZN <dbl> 15.27, 15.27, 15.81, 15.61, 15.34, 16.49, 16.98, 16.01, 16.35, 15…
## $ IBM <dbl> NA, 106.50, 104.72, 101.96, 103.85, 107.25, 108.53, 107.82, 108.5…
## $ WMT <dbl> NA, 47.34, 48.25, 47.50, 48.85, 51.85, 52.90, 53.48, 53.35, 53.98…
## $ XOM <dbl> NA, 43.40, 43.36, 42.88, 42.48, 42.72, 42.99, 43.00, 42.65, 42.12…
## # A tibble: 6 × 6
## Date AAPL AMZN IBM WMT XOM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001-07-05 1.66 15.3 NA NA NA
## 2 2001-07-06 1.57 15.3 106. 47.3 43.4
## 3 2001-07-09 1.62 15.8 105. 48.2 43.4
## 4 2001-07-10 1.51 15.6 102. 47.5 42.9
## 5 2001-07-11 1.61 15.3 104. 48.8 42.5
## 6 2001-07-12 1.74 16.5 107. 51.8 42.7
## # A tibble: 6 × 6
## Date AAPL AMZN IBM WMT XOM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-05-23 154. 972. 152. 78.5 82.6
## 2 2017-05-24 153. 980. 153. 78.2 82.3
## 3 2017-05-25 154. 993. 153. 78.3 81.8
## 4 2017-05-26 154. 996. 152. 78.1 81.6
## 5 2017-05-30 154. 997. 152. 78.2 81.1
## 6 2017-05-31 153. 995. 153. 78.6 80.5
Examining at observations in the Date column we infer the following :
We have Null values in the 1st row so let us omit them
## # A tibble: 6 × 6
## Date AAPL AMZN IBM WMT XOM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001-07-06 1.57 15.3 106. 47.3 43.4
## 2 2001-07-09 1.62 15.8 105. 48.2 43.4
## 3 2001-07-10 1.51 15.6 102. 47.5 42.9
## 4 2001-07-11 1.61 15.3 104. 48.8 42.5
## 5 2001-07-12 1.74 16.5 107. 51.8 42.7
## 6 2001-07-13 1.78 17.0 109. 52.9 43.0
Let us select only the last 4 columns and convert the data to a time series object applying ts()
# Data starts in 2001 - 7 observations/month (monthly)
mystockts = ts(data = stock_data[,2:6], ## All columns starting from 2nd
start = c(2001,6), ## Start Year 2001 , 7th Month
end = c(2017 , 5) , ##End Year 2017 , 5th Month
frequency = 12) ## Monthly hence frequency = 12
# Checking start , end and frequency
print(start(mystockts))## [1] 2001 6
## [1] 2017 5
## [1] 12
Looks like the 1 row has null values. Let’s omit them
The graph above shows lot of activity in IBM however it’s is far higher than other stocks
Let’s look at weekly economy passenger load on Ansett Airlines between Melbourne and Sydney. This dataset is part of the fpp2 package (melsyd)
## Time Series:
## Start = c(1987, 26)
## End = c(1987, 31)
## Frequency = 52
## First.Class Business.Class Economy.Class
## 1987.481 1.912 NA 20.167
## 1987.500 1.848 NA 20.161
## 1987.519 1.856 NA 19.993
## 1987.538 2.142 NA 20.986
## 1987.558 2.118 NA 20.497
## 1987.577 2.048 NA 20.770
Let’s plot only the Economy Class Data
# Economy Class Passenger Load
autoplot(melsyd[,"Economy.Class"]) +
ggtitle("Economy class passengers: Melbourne-Sydney") +
xlab("Year") +
ylab("Thousands")There was a period in 1989 when no passengers were carried — this was due to an industrial dispute.
There was a period of reduced load in 1992. This was due to a trial in which some economy class seats were replaced by business class seats.
A large increase in passenger load occurred in the second half of 1991.
There are some large dips in load around the start of each year. These are due to holiday effects.
There is a long-term fluctuation in the level of the series which increases during 1987, decreases in 1989, and increases again through 1990 and 1991.
There are some periods of missing observations
Let’s look at the Anti-diabetic drug sales data set that accompanies the fpp2 package(a10)
# Ant-Diabetic Drug Sales
autoplot(a10) +
ggtitle("Antidiabetic drug sales") +
ylab("$ million") +
xlab("Year")There is a clear and increasing trend.
There is a strong seasonal pattern that increases in size as the level of the series increases.
There is a sudden drop at the start of each year possibly caused by a government subsidisation scheme that makes it cost-effective for patients to stockpile drugs at the end of the calendar year
Let us learn two more functions which.max() and
frequency()
– which.max() can be used to identify the smallest index of
the maximum value.
– to find the number of observations per unit time, we use
frequency()
## [1] 770
Observation#770 is the outlier as seen from the Time-Series Plot
## [1] 1
This is an ANNUAL Time Series
Let’s look at the Australian Gas Prices dataset accompanying the fpp2 package (gas)
## [1] 12
This is a MONTHLY Time Series
Along with time plots, there are other useful ways of plotting data to emphasize seasonal patterns and show changes in these patterns over time.
A seasonal plot is similar to a time plot except that the data
are plotted against the individual “seasons” in which the data were
observed. You can create one using the ggseasonplot()
function the same way you do with autoplot().
An interesting variant of a season plot uses polar coordinates,
where the time axis is circular rather than horizontal; to make one,
simply add a polar argument and set it to
TRUE.
A sub series plot comprises mini time plots for each season. Here, the mean for each season is shown as a blue horizontal line.
One way of splitting a time series is by using the
window() function, which extracts a subset from the object
x observed between the times start and
end
example : window(x, start = NULL, end = NULL).
Let us visualize two data sets from the fpp2 package :
a10 contains monthly sales volumes for anti-diabetic
drugs in Australia. In the plots, can you see which month has the
highest sales volume each year? What is unusual about the results in
March and April 2008?
ausbeer which contains quarterly beer production for
Australia. What is happening to the beer production in Quarter
4?
# Create the basic plot of the a10 data
p2 = autoplot(a10)
# Create a seasi=onal plot of a10
p3 = ggseasonplot(a10 , year.labels=TRUE, year.labels.left=TRUE) +
ylab("$ million") +
ggtitle("Seasonal plot: antidiabetic drug sales")
# Produce a polar coordinate season plot for the a10 data
p1 = ggseasonplot(a10, polar = TRUE)
p1 | p2/p3There is an increase in drug sales every year as indicated by the 2nd plot.
There is a large jump in sales in January each year , these are probably sales in late December as customers stockpile before the end of the calendar year, but not registered with the government until a week or two later.
Sales have steadily increased. Starting 1980 , the sales oscillate around 500. Let us now restrict the ausbeer series to start at 1992 using the window function.
## [1] 4
# Make plots of the beer data
p1 = autoplot(beer)
# Let us look at "subseries" plots of beer
p2 = ggsubseriesplot(beer)
p1 + p2Q1 : Mean Production ~ 450 {Blue Line} ; Q2 : Mean Production ~ 410 {Blue Line}
Q3 : Mean Production ~ 425 {Blue Line) ; Q4 : Mean Production ~ 525 {Blue Line}
The arrivals data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.
Use autoplot(), ggseasonplot() and ggsubseriesplot() to compare the differences between the arrivals from these four countries.
Can you identify any unusual observations?
## Japan NZ UK US
## 1981 Q1 14.763 49.140 45.266 32.316
## 1981 Q2 9.321 87.467 19.886 23.721
## 1981 Q3 10.166 85.841 24.839 24.533
## 1981 Q4 19.509 61.882 52.264 33.438
## 1982 Q1 17.117 42.045 53.636 33.527
## 1982 Q2 10.617 63.081 34.802 28.366
## Japan NZ UK US
## Min. : 9.321 Min. : 37.04 Min. : 19.89 Min. : 23.72
## 1st Qu.: 74.135 1st Qu.: 96.52 1st Qu.: 53.89 1st Qu.: 63.95
## Median :135.461 Median :154.54 Median : 95.56 Median : 85.88
## Mean :122.080 Mean :170.59 Mean :106.86 Mean : 84.85
## 3rd Qu.:176.752 3rd Qu.:228.60 3rd Qu.:128.13 3rd Qu.:108.98
## Max. :227.641 Max. :330.81 Max. :269.29 Max. :136.09
# Plot seasonal Time Series (with and without facets)
p1= autoplot(arrivals)
p2 = autoplot(arrivals , facets = TRUE) + geom_smooth() +
labs("International arrivals to Australia",
y = "Arrivals (in thousands)",
x = NULL)
p1+p2# Subseries plots by country
p1 = ggsubseriesplot(arrivals[,1]) + ggtitle("Japanese Arrivals")
p2 = ggsubseriesplot(arrivals[,2]) + ggtitle("New Zealand Arrivals")
p3 = ggsubseriesplot(arrivals[,3]) + ggtitle("UK Arrivals")
p4 = ggsubseriesplot(arrivals[,4]) + ggtitle("USA Arrivals")
p1 + p2 + p3 + p4Japan : Arrivals into Australia increased till late 90s and then began to drop starting the turn of the millennium
New Zeland : Arrivals into Australia are increasing year on year with highest volume of passengers since 2005.
UK & USA : Arrivals into Australia increased @ the year 2000 and then on there seems to be a stagnation
autoplot(arrivals, facets = TRUE) +
geom_smooth() +
labs("International arrivals to Australia",
y = "Arrivals (in thousands)",
x = NULL)The various reasons or the forces which affect the values of an observation in a time series are the components of a time series. The four categories of the components of time series are Trend , Seasonal Variations , Cyclic Variations and Random or Irregular movements,
A trend in time series data refers to a long-term upward or downward movement in the data, indicating a general increase or decrease over time. It represents the underlying structure of the data, capturing the direction and magnitude of change over a longer period. The population, agricultural production, items manufactured, number of births and deaths, number of industry or any factory, number of schools or colleges are some of its example showing some kind of trend.
There are several types of trends in Time Series data :
Upward Trend: A trend that shows a general increase over time, where the values of the data tend to rise over time.
Downward Trend: A trend that shows a general decrease over time, where the values of the data tend to decrease over time.
Horizontal Trend: A trend that shows no significant change over time, where the values of the data remain constant over time.
Non-linear Trend: A trend that shows a more complex pattern of change over time, including upward or downward trends that change direction or magnitude over time.
Damped Trend: A trend that shows a gradual decline in the magnitude of change over time, where the rate of change slows down over time.
The time series data can have a combination of these types of trends or multiple trends present simultaneously. Accurately identifying and modeling the trend is a crucial step in time series analysis, as it can significantly impact the accuracy of forecasts and the interpretation of patterns in the data.
It is clearly trended, with a change in the slope of the trend around 1990.
Seasonality in time series data refers to patterns that repeat over a regular time period, such as a day, a week, a month, or a year. These patterns arise due to regular events, such as holidays, weekends, or the changing of seasons, and can be present in various types of time series data, such as sales, weather, or stock prices.
There are several types of seasonality in time series data, including:
Weekly Seasonality: A type of seasonality that repeats over a 7-day period and is commonly seen in time series data such as sales, energy usage, or transportation patterns.
Monthly Seasonality: A type of seasonality that repeats over a 30- or 31-day period and is commonly seen in time series data such as sales or weather patterns.
Annual Seasonality: A type of seasonality that repeats over a 365- or 366-day period and is commonly seen in time series data such as sales, agriculture, or tourism patterns.
Holiday Seasonality: A type of seasonality that is caused by special events such as holidays, festivals, or sporting events and is commonly seen in time series data such as sales, traffic, or entertainment patterns.
It’s important to note that time series data can have multiple types of seasonality present simultaneously, and accurately identifying and modeling the seasonality is a crucial step in time series analysis.
These variations come into play either because of the natural forces or man-made conventions. The various seasons or climatic conditions play an important role in seasonal variations. Such as production of crops depends on seasons, the sale of umbrella and raincoats in the rainy season, and the sale of electric fans and A.C. shoots up in summer seasons.
Let’s visualize the Quarterly Clay brick production dataset
Let’s visualize the same by dataset applying the seasonal plot method
Here, we see that brick production has consistently increased over the years as the lower (darker) lines represent earlier years and the higher (lighter) lines represent recent years. Also, we see that brick production tends to be the lowest in Q1 and typically peaks in Q3 before leveling off or decreasing slightly in Q4
Cyclicity in time series data refers to the repeated patterns or periodic fluctuations that occur in the data over a specific time interval. It can be due to various factors such as seasonality (daily, weekly, monthly, yearly), trends, and other underlying patterns.
Difference between Seasonality and Cyclicity
Seasonality refers to a repeating pattern in the data that occurs over a fixed time interval, such as daily, weekly, monthly, or yearly. Seasonality is a predictable and repeating pattern that can be due to various factors such as weather, holidays, and human behavior.
Cyclicity, on the other hand, refers to the repeated patterns or fluctuations that occur in the data over an unspecified time interval. These patterns can be due to various factors such as economic cycles, trends, and other underlying patterns. Cyclicity is not limited to a fixed time interval and can be of different frequencies, making it harder to identify and model.
For example, a business cycle might last 3 or 5 or 8 years between peaks or troughs. But a seasonal pattern is always of the same length
We need to distinguish between seasonal and cyclic patterns as very different time series models are used in each case.
Seasonal patterns have constant length, while cyclic patterns have variable length.
If both exist together, then the average length of the cyclic pattern is longer the length of the seasonal pattern.
The size of the cycles tends to be more variable than the size of the seasonal fluctuations. As a result, it is much harder to predict cyclic data than seasonal data.
Let’s visualize the Annual Lynx trappings dataset. This is the number of lynx trapped annually in the Hudson Bay region of Canada from 1821 to 1934.
Because this is annual data, it cannot be seasonal. The population of lynx rises when there is plenty of food, and when the food supply gets low, they stop breeding causing the population to plummet. The surviving lynx then have plenty of food, start to breed again, and the cycle continues. The length of these cycles varies from between 8 and 11 years.
This is also a good example to show how variable the magnitude of cyclic patterns can be, with the largest peak being more than three times the size of the smallest peak
Let’s visualize the quarterly cement production dataset
We can see a few cycles in our cement data in the early ’80s, ’90s, ’00s, and around 2008 - all these date ranges are around economic depressions that occurred
Let’s analyze the dengue cases dataset which has number of dengue cases collected in the Philippines.
## Month Year Region Dengue_Cases
## 1 Jan 2008 Region.I 2.953926
## 2 Feb 2008 Region.I 2.183336
## 3 Mar 2008 Region.I 0.972410
## 4 Apr 2008 Region.I 9.357156
## 5 May 2008 Region.I 7.320599
## 6 Jun 2008 Region.I 4.513452
## Month Year Region Dengue_Cases
## 1831 Jul 2016 NCR 7.982075
## 1832 Aug 2016 NCR 10.126983
## 1833 Sep 2016 NCR 9.900541
## 1834 Oct 2016 NCR 8.460119
## 1835 Nov 2016 NCR 7.541772
## 1836 Dec 2016 NCR 7.831114
The data has been collected for all 12 months of different years and the years range from 2008 going down 2016
Let’s aggregate the data by Month and Year
#Aggregate regional data to represent the entire country
df_agg=aggregate(Dengue_Cases~Month+Year,df,sum)Let’s now convert the Dengue-Cases data to a time series with a frequency of 12 (monthly) , starting at 2008-Jan and ending at 2016-Dec
# creating a time series object
dengue_ts = ts(df_agg$Dengue_Cases ,
frequency = 12 ,
start = c(2008,1) ,
end = c(2016,12))
glimpse(dengue_ts)## Time-Series [1:108] from 2008 to 2017: 131.1 160 93.7 49.4 79.9 ...
Let’s plot the time series with a smoother line to visualize the trend
# visualizing dengue cases ovet time
autoplot(dengue_ts) +
geom_smooth() +
labs("Monthly Dengue cases in Philipines",
y = "Number of Cases)",
x = NULL)The number of cases they were more, more or less consistent from 2008 to 2010, maybe a slight increase, and then followed by a definite increase from, 2010 all the way to 2014 to 2016, the then cases have been on a decline
Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category.
A time series can be considered composed of 4 main parts: trend, cycle, seasonality, and the irregular or remainder/residual part.
The Trend component is the longest-term behavior of a time series. The simplest model for a trend is a linear increase or decrease, but the trend cannot be linear
The trend component of the series is often considered along with the cyclic one (trend-cycle). The cyclical component is represented by fluctuations (rises and falls) not occurring at a fixed frequency. The cycle component is therefore different from the seasonal variation (see below) in that it does not follow a fixed calendar frequency
The Seasonal component is a repeated pattern occurring at a fixed time period such as the time of the year or the day of the week (the frequency of seasonality, which is always a fixed and known frequency)
The Irregular or remainder/residual component is the random-like part of the series.In general, when we fit mathematical models to time series data, the residual error series represents the discrepancies between the fitted values, calculated from the model, and the data
We usually combine the trend and cycle into a single trend-cycle component (sometimes called the trend for simplicity). Thus we think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series)
Decomposition methods try to identify and separate the above mentioned parts of a time series. Usually they consider together the trend and cycle (trend-cycle) - the longer-term changes in the series - and the (seasonal factors) - periodic fluctuations of constant length happening at a specific calendar frequency.
There are two main ways through which these elements can be combined together: in the additive and the multiplicative form
The additive model (xt=mt+st+zt) , where xt is the observed time series, mt is the trend-cycle component, st is the seasonal component and zt is the residual) is useful when the seasonal variation is relatively constant over time
The multiplicative model (xt=mt∗st∗zt) is useful when the the seasonal effects tends to increase as the trend increases.
The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decomposition’s are common with economic time series.
One of the main objectives for a decomposition is to estimate seasonal effects that can be used to create and present seasonally adjusted values.
A seasonally adjusted value removes the seasonal effect from a value so that trends can be seen more clearly. For instance, in many regions of the U.S. unemployment tends to decrease in the summer due to increased employment in agricultural areas. Thus a drop in the unemployment rate in June compared to May doesn’t necessarily indicate that there’s a trend toward lower unemployment in the country. To see whether there is a real trend, we should adjust for the fact that unemployment is always lower in June than in May.
The classical method of time series decomposition originated in the 1920s and was widely used until the 1950s. It still forms the basis of many time series decomposition methods, so it is important to understand how it works.
Each data point can be replaced with the mean of that observation and one observation before and after it. This is called a centered moving average. A centered moving average is defined as
St = (Yt-q + … + Yt + … + Yt+q)/(2q + 1)
where St is the smoothed value at time t and k = 2q + 1 is the number of observations that are averaged. The k value is usually chosen to be an odd number. By necessity, when using a centered moving average, you lose the (k – 1) / 2 observations at each end of the series.
Several functions in R can provide a simple moving average, including SMA() in the TTR package, rollmean() in the zoo package, and ma() in the forecast package. Here, you’ll use the ma() function to smooth the Nile time series that comes with the base R installation.
The Nile time series which has the annual flow of the river Nile at Aswan from 1871–1970.
We will examine the time series with k = 3 , 7 and 15
## Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
# Basic Time series
ylim <- c(min(Nile), max(Nile))
autoplot(Nile) +
ggtitle("Raw time series") +
scale_y_continuous(limits=ylim)This appears to be a classical “Non-Seasonal Time Series”
# Plot the basic series
ylim <- c(min(Nile), max(Nile))
p1 = autoplot(Nile) +
ggtitle("Raw time series") +
scale_y_continuous(limits=ylim)# Moving average k = 3
p2 = autoplot(ma(Nile, 3)) +
ggtitle("Simple Moving Averages (k=3)") +
scale_y_continuous(limits=ylim)# Moving average k = 7
p3 = autoplot(ma(Nile, 7)) +
ggtitle("Simple Moving Averages (k=7)") +
scale_y_continuous(limits=ylim)# Moving average k = 15
p4 = autoplot(ma(Nile, 15)) +
ggtitle("Simple Moving Averages (k=15)") +
scale_y_continuous(limits=ylim)As k increases, the plot becomes increasingly smoothed. The challenge is to find the value of k that highlights the major patterns in the data without under- or over-smoothing. This is more art than science, and you’ll probably want to try several values of k before settling on one. From the plots , there certainly appears to have been a drop in river flow between 1892 and 1900 , with a small increasing trend between 1941 and 1961, but this could also have been a random variation.
Let us look at another time series the file http://robjhyndman.com/tsdldata/misc/kings.dat contains data on the age of death of successive kings of England, starting with William the Conqueror (original source: Hipel and Mcleod, 1994).
## [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69 59 48
## [26] 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
## Time Series:
## Start = 1
## End = 42
## Frequency = 1
## [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69 59 48
## [26] 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
# Basic time series
ylim <- c(min(kingstimeseries), max(kingstimeseries))
autoplot(kingstimeseries) +
ggtitle("Raw time series") +
scale_y_continuous(limits=ylim)Let’s try to smooth the time series using a simple moving average of order 3, and plot the smoothed time series data
# Moving average k = 3
autoplot(ma(kingstimeseries , 3)) +
ggtitle("Simple Moving Averages (k=3)") +
scale_y_continuous(limits=ylim)There still appears to be quite a lot of random fluctuations in the time series smoothed using a simple moving average of order 3. Thus, to estimate the trend component more accurately, we might want to try smoothing the data with a simple moving average of a higher order. This takes a little bit of trial-and-error, to find the right amount of smoothing. For example, we can try using a simple moving average of order 8
# Moving average k = 8
autoplot(ma(kingstimeseries , 8)) +
ggtitle("Simple Moving Averages (k=8)") +
scale_y_continuous(limits=ylim)The data smoothed with a simple moving average of order 8 gives a clearer picture of the trend component, and we can see that the age of death of the English kings seems to have decreased from about 55 years old to about 38 years old during the reign of the first 20 kings, and then increased after that to about 73 years old by the end of the reign of the 40th king in the time series.
For time-series data with a periodicity greater than 1 (that is, with a seasonal component), we’d want to go beyond a description of the overall trend. You can use seasonal decomposition to examine both seasonal and general trends.
Time-series data that have a seasonal aspect (such as monthly or quarterly data) can be decomposed into a trend component, a seasonal component, and an irregular component. The trend component captures changes in level over time. The seasonal component captures cyclical effects due to the time of year. The irregular (or error) component captures those influences not described by the trend and seasonal effects.The decomposition can be additive or multiplicative
An example may make the difference between additive and multiplicative models clearer. Consider a time series that records the monthly sales of motorcycles over a 10-year period.
In a model with an additive seasonal effect, the number of motorcycles sold tends to increase by 500 in November and December (due to the Christmas rush) and decrease by 200 in January (when sales tend to slow down). The seasonal increase or decrease is independent of the current sales volume.
In a model with a multiplicative seasonal effect, motorcycle sales in November and December tend to increase by 20% and decrease in January by 10%. In the multiplicative case, the impact of the seasonal effect is proportional to the current sales volume, which isn’t the case in an additive model. In many instances, the multiplicative model is more realistic.
A popular method for decomposing a time series into trend, seasonal, and irregular components is seasonal decomposition by loess smoothing. In R, this can be accomplished with the stl() function. The format is stl(ts, s.window=, t.window=)
ts is the time series to be decomposed.
t.window and s.window control how rapidly the
trend-cycle and seasonal components can change. Smaller values allow for
more rapid changes.
Both t.window and s.window should be odd
numbers; t.window is the number of consecutive observations
to be used when estimating the trend-cycle; s.window is the
number of consecutive years to be used in estimating each value in the
seasonal component.
The user must specify s.window as there is no default.
Setting it to be infinite is equivalent to forcing the seasonal
component to be periodic (i.e., identical across years).
Specifying t.window is optional, and a default value
will be used if it is omitted.
The stl() function can only handle additive models, but this isn’t a serious limitation.
Multiplicative models can be transformed into additive models using a log transformation:
log(Yt) = log(Trendt × Seasonalt × Irregulart)
= log(Trendt) + log(Seasonalt) + log(Irregulart)
After fitting the additive model to the log-transformed series, the results can be back-transformed to the original scale. Let’s look at an example.
The time series AirPassengers comes with a base R installation and describes the monthly totals (in thousands) of international airline passengers between 1949 and 1960
From the graph, it appears that variability (height of peaks and valleys) of the series increases with the level, suggesting a multiplicative model
Let’s display the time series created by taking the log of each observation
# log transformation
lAirPassengers <- log(AirPassengers)
autoplot(lAirPassengers, ylab="log(AirPassengers)") +
ggtitle("Log transformed time series")The variance has stabilized, and the logged series looks like an appropriate candidate for an additive decomposition
Let us decompose it with parameters as s.window = periodic and s.window = 13 and save in an object called fit and plot it
# fit the decomposed model
fit <- stl(lAirPassengers, s.window="period")
# plot the decomposed series
autoplot(fit)Plotting the results gives the graph , which shows the time series,
seasonal, trend, and irregular components from 1949 to 1960. Note that
the seasonal components have been constrained to remain the same across
each year (using the s.window="period" option).
The trend is monotonically increasing, and the seasonal effect suggests more passengers in the summer (perhaps during vacations).
The grey bars on the right are magnitude guides - each bar represents the same magnitude. This is useful because the y-axes are different for each graph
The object returned by the stl() function includes a
component called time .series that contains the trend,
season, and irregular portion of each observation. In this case,
fit$time.series is based on the logged time series
## seasonal trend remainder
## Jan 1949 -0.09164042 4.829389 -0.0192493585
## Feb 1949 -0.11402828 4.830368 0.0543447685
## Mar 1949 0.01586585 4.831348 0.0355884457
## Apr 1949 -0.01402759 4.833377 0.0404632511
## May 1949 -0.01502478 4.835406 -0.0245905300
## Jun 1949 0.10978976 4.838166 -0.0426814256
## Jul 1949 0.21640041 4.840927 -0.0601151688
## Aug 1949 0.20960587 4.843469 -0.0558624690
## Sep 1949 0.06747156 4.846011 -0.0008273977
## Oct 1949 -0.07024836 4.850883 -0.0015112948
exp(fit$time .series) converts the decomposition back to
the original metric
## seasonal trend remainder
## Jan 1949 0.9124332 125.1344 0.9809347
## Feb 1949 0.8922327 125.2571 1.0558486
## Mar 1949 1.0159924 125.3798 1.0362293
## Apr 1949 0.9860703 125.6345 1.0412930
## May 1949 0.9850875 125.8897 0.9757094
## Jun 1949 1.1160434 126.2377 0.9582166
## Jul 1949 1.2415994 126.5866 0.9416561
## Aug 1949 1.2331919 126.9088 0.9456692
## Sep 1949 1.0697998 127.2318 0.9991729
## Oct 1949 0.9321623 127.8533 0.9984898
## Nov 1949 0.8077298 128.4777 1.0021654
## Dec 1949 0.9042619 129.6173 1.0067574
## Jan 1950 0.9124332 130.7670 0.9638262
## Feb 1950 0.8922327 132.0634 1.0693259
## Mar 1950 1.0159924 133.3726 1.0405479
Examining the seasonal effects suggests that the number of passengers increased by 24% in July (a multiplier of 1.24) and decreased by 20% in November (a multiplier of .80).
Let’s look at the plots by month and the seasonal plots
# month plot
ggmonthplot(AirPassengers) +
labs(title="Month plot: AirPassengers",
x="",
y="Passengers (thousands)")# seasonal plot
p = ggseasonplot(AirPassengers) + geom_point() +
labs(title="Seasonal plot: AirPassengers",
x="",
y="Passengers (thousands)")
direct.label(p)The month plot displays the subseries for each month (all January values connected, all February values connected, and so on), along with the average of each subseries. From this graph, it appears that the trend is increasing for each month in a roughly uniform way. Additionally, the greatest number of passengers occurs in July and August.
The season plot displays the subseries by year. Again, you see a
similar pattern, with increases in passengers each year and the same
seasonal pattern. By default, the ggplot2 package would
create a legend for the year variable. The directlabels
package is used to place the year labels directly on the graph, next to
each line in the time series.
Another way to look at time series data is to plot each observation
against another observation that occurred some time previously by
using gglagplot(). For example, you could plot yt against
yt−1. This is called a lag plot because you are plotting the time series
against lags of itself.
Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series
For example, values that are separated by an interval might have a strong positive or negative correlation. When these correlations are present, they indicate that past values influence the current value. Analysts use the autocorrelation and partial autocorrelation functions to understand the properties of time series data, fit the appropriate models, and make forecasts.
The autocorrelation function (ACF) assesses the correlation between observations in a time series for a set of lags. The ACF for time series y is given by: Corr (yt,yt−k), k=1,2,…
Use the autocorrelation function (ACF) to identify which lags have significant correlations, understand the patterns and properties of the time series, and then use that information to model the time series data. From the ACF, you can assess the randomness and stationarity of a time series. You can also determine whether trends and seasonal patterns are present.
The ggAcf() function produces ACF plots sometimes known
as a correlogram
Let us look at Annual oil production (millions of tonnes), Saudi Arabia, 1965-2013
Let us plot the relationship between yt and yt−k, k=1,…,9 using gglagplot()
Let us plot the correlations between yt and yt−k, k=1,…,9 using ggAcf()
We can extract the correlation coefficients for lags by specifying plot = FALSE
##
## Autocorrelations of series 'oil', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.866 0.710 0.545 0.387 0.244 0.096 -0.007 -0.103 -0.150 -0.142
## 11 12 13 14 15 16
## -0.142 -0.081 -0.015 0.042 0.113 0.178
Looking at the lag plot , strongest correlation is exhibited at “Lag=1”.
Looking at ACF Plot , r1 is higher than for the other lags. This is due to the absence of seasonal pattern in the data
The dashed blue lines indicate whether the correlations are significantly different from zero
Lets look at the quarterly beer production in Australia (in megalitres) from 1956:Q1 to 2010:Q2. Let’s extract the data from 1992 onward
# Create a plot of the quarterly beer production
beer2 <- window(ausbeer, start=1992)
autoplot(beer2)Let’s plots the relations between lags using the lag plot
Let’s extract the correlation coefficients across lags
##
## Autocorrelations of series 'beer2', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 -0.102 -0.657 -0.060 0.869 -0.089 -0.635 -0.054 0.832 -0.108 -0.574
## 11 12 13 14 15 16 17 18
## -0.055 0.774 -0.080 -0.568 -0.066 0.706 -0.065 -0.528
Let’s plot the ACF
The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data.
The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2)
r4 is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be four quarters apart.
r2 is more negative than for the other lags because troughs tend to be two quarters behind peaks.
When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive values that slowly decrease as the lags increase.
When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags.
When data are both trended and seasonal, you see a combination of these effects
Let’s investigate these phenomenon by plotting the annual sunspot
series (which follows the solar cycle of approximately 10-11 years)
in sunspot.year and the daily traffic to the Hyndsight blog
(which follows a 7-day weekly pattern) in hyndsight
##
## Autocorrelations of series 'sunspot.year', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.814 0.447 0.043 -0.262 -0.408 -0.361 -0.158 0.141 0.436 0.607
## 11 12 13 14 15 16 17 18 19 20 21
## 0.604 0.435 0.168 -0.094 -0.281 -0.346 -0.298 -0.149 0.053 0.246 0.371
## 22 23 24
## 0.380 0.265 0.065
Here we observe the maximum autocorrelation at lag = 1
##
## Autocorrelations of series 'hyndsight', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.651 0.231 0.030 0.001 0.193 0.544 0.734 0.532 0.164 -0.016
## 11 12 13 14 15 16 17 18 19 20 21
## -0.018 0.141 0.468 0.660 0.460 0.116 -0.045 -0.034 0.125 0.456 0.631
## 22 23 24 25
## 0.421 0.094 -0.071 -0.070
Here we observe the maximum autocorrelation at lag = 7
Let’s examine the monthly Australian electricity consumption time series aelec starting 1990
# subset the data to start from 1990
aelec <- window(elec, start=1980)
# plot the original time series
autoplot(aelec) + xlab("Year") + ylab("GWh")Let’s decompose and plot the time-series
Let’s plot the ACF
##
## Autocorrelations of series 'aelec', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 1.000 0.936 0.888 0.823 0.776 0.758 0.726 0.746 0.748 0.767 0.805 0.820 0.845
## 13 14 15 16 17 18 19 20 21 22 23 24
## 0.787 0.741 0.678 0.634 0.618 0.590 0.606 0.611 0.626 0.659 0.670 0.692
The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality
Time series that show no autocorrelation are called White Noise. It is just random, independent and identically distributed (iid) observations. In short-hand, statisticians often call this “iid data”. In other words, there is nothing going on - no trend, no seasonality, no cyclicity, not even any auto-correlations. Just randomness. The name “White Noise” comes from physics where white light has some similar mathematical characteristics
Here is an example of a White Noise Time Series
Let’s plot the autocorrelations
The autocorrelation function of white noise consists of many insignificant spikes. Because the data is simply random, we expect correlations between observations to be close to zero.
The dashed blue lines are there to show us how large a spike has to be before we can consider it significantly different from zero
Let’s examine the time series data pigs which has Monthly total number of pigs slaughtered in Victoria, Australia (Jan 1980 – Aug 1995)
# basic plot
pigs <- window(pigs, start=1990)
autoplot(pigs/1000)+
xlab("Year")+
ylab("thousands")+
ggtitle("Monthly number of pigs slaughtered in Victoria")At first glance it looks relatively random. Possibly there’s a slight upward trend, but it is hard to see the difference between this and a white noise series on the basis of a time plot.
Let’s plot a decomposed version of the original time series
Evidently there exists seasonality and and a more or less cyclic trend
Let’s look at the ACF plot
When we look at the ACF plot of the same data, we can see there is actually some information in the data. The first three spikes are significantly larger than zero. So we can be confident that this is not a white noise series and thet there is some information in the data that can be used in building a forecasting model
Looking at an ACF is useful, but sometimes it is easier to test all the auto-correlations together, rather than consider each one separately. To do this, we can use a Ljung-Box test.It considers the first h autocorrelation values to see if they, as a group, look like what you would expect from a white noise series.
The Ljung (pronounced Young) Box test (sometimes called the modified Box-Pierce, or just the Box test) is a way to test for the absence of serial autocorrelation, up to a specified lag k.
The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the auto-correlations for the errors or residuals are non zero. Essentially, it is a test of lack of fit: if the auto-correlations of the residuals are very small, we say that the model doesn’t show ‘significant lack of fit’.
The Null Hypothesis H0 is that the residuals are independently distributed. The Alternative Hypothesis HA is that the residuals are not independently distributed and exhibit a serial correlation.
A significant p-value rejects the null hypothesis that the time series isn’t autocorrelated.
Let’s test whether the pigs time series exhibits autocorrelation for the 1st 24 lags
##
## Box-Ljung test
##
## data: pigs
## X-squared = 42.815, df = 24, p-value = 0.01044
Here the p-value is very small, again suggesting that this is not a white noise series
There is a well-known result in economics called the “Efficient Market Hypothesis” that states that asset prices reflect all available information. A consequence of this is that the daily changes in stock prices should behave like white noise (ignoring dividends, interest rates and transaction costs). The consequence for forecasters is that the best forecast of the future price is the current price.
Let’s test this hypothesis by looking at the goog
series, which contains the closing stock price for Google over 1000
trading days 25-Feb-2013 through 13-Feb-2017.
# Plot the original series
autoplot(goog) +
ggtitle("Closing stock prices of Google over 1000 days")We are interested in daily changes in stock prices. Let’s apply the diff function to calculate the daily change in prices and plot the same
# Plot the differenced series
autoplot(diff(goog)) +
ggtitle("Daily changes in Google stock prices")Now let’s plot the ACF to check if daily changes are “White Noise”
##
## Autocorrelations of series 'diff(goog)', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.039 -0.004 -0.065 -0.056 -0.027 0.014 -0.012 -0.053 -0.018 -0.009
## 11 12 13 14 15 16 17 18 19 20 21
## 0.003 -0.035 0.006 0.000 -0.028 0.029 0.059 0.006 0.015 0.054 0.035
## 22 23 24 25 26 27 28 29
## -0.004 -0.025 -0.015 -0.089 -0.030 0.001 0.046 -0.061
We do not observe any large positive coefficients and other than a negative correlation at lag = 25 which seems in-significant. We can infer that the daily price changes are “not-correlated”
Let’s apply the Ljung Box test to the daily changes with lag = 10
# Ljung-Box test of the differenced series
Box.test(diff(goog), lag = 10, fitdf = 0, type = "Ljung")##
## Box-Ljung test
##
## data: diff(goog)
## X-squared = 13.123, df = 10, p-value = 0.2169
We see that the Ljung-Box test was not significant, so daily changes in the Google stock price look like white noise