Working with Date and Time

There are different classes in which date and time can be formatted and used. The first among them is “POSIX” class. There are two formats of POSIX class. One is “POSIXlt” and the other is “POSIXct”. Let us examine those two. The formats specified in the code are standard formats.

x = as.POSIXlt("2017-08-09 13:45:34")
class(x)
## [1] "POSIXlt" "POSIXt"

Let us unclass it

unclass(x)
## $sec
## [1] 34
## 
## $min
## [1] 45
## 
## $hour
## [1] 13
## 
## $mday
## [1] 9
## 
## $mon
## [1] 7
## 
## $year
## [1] 117
## 
## $wday
## [1] 3
## 
## $yday
## [1] 220
## 
## $isdst
## [1] 0
## 
## $zone
## [1] "IST"
## 
## $gmtoff
## [1] NA

The unclassing of the given time shows the different parameters of the date and time that are stored. This is stored as a list. We can extract this using names. For example to get the timezone…

x$zone
## [1] "IST"

Now let us look at “POSIXct”.

x <- as.POSIXct("2017-09-01 11:12:13")
class(x)
## [1] "POSIXct" "POSIXt"

Unclassing it..

unclass(x)
## [1] 1504244533
## attr(,"tzone")
## [1] ""

The number that appeared is the number of seconds from 1st Jan 1970. It is the reference time for ‘POSIX’ class.

The other structure that is available is the “as.Date”. If we unclass it, we will get the number of days after reference date.

x <- as.Date("2017-09-09")
class(x)
## [1] "Date"
unclass(x)
## [1] 17418

The next function is from the “chron” library and the function is “chron()”. It has different format compared with POSIX class. The date should be formatted as “MM/DD/YYYY”. The advantage with chron is, it doesnot consider the timezone. So if the dataset we are working with belong to same timezone, chron() will make it easier for calculations. Similar to as.Date, it counts the number of days from reference date, but the extra hours are counted on decimals. Date and time are given as two different inputs for this function. Let us have a look at it.

library(chron)


x <- chron("12/25/17", "23:45:59")
class(x)
## [1] "chron" "dates" "times"
unclass(x)
## [1] 17525.99
## attr(,"format")
##   dates   times 
## "m/d/y" "h:m:s" 
## attr(,"origin")
## month   day  year 
##     1     1  1970

Lubridate Package

‘Lubridate’ package helps us to define time and date along with time zone. We can specify date in yyyymmdd, ddmmyyyy, mmddyyy format and convert them into time series object using ‘ymd’, ‘dmy’, ‘mdy’ respectively.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:chron':
## 
##     days, hours, minutes, seconds, years
## The following object is masked from 'package:base':
## 
##     date
date <- ymd(20171216)
date
## [1] "2017-12-16"
class(date)
## [1] "Date"
date <- dmy(16122017)
date
## [1] "2017-12-16"
class(date)
## [1] "Date"
date <- mdy(12162017)
date
## [1] "2017-12-16"
class(date)
## [1] "Date"

The above functions doesnot deal or apply timezones. We checked by looking at the object. So we will use ‘ymd_hm’ function to specify date time and timezone. We use ‘OlsonNames’ function to see the available timezones. It is alist of around 600 timezones, so I will list out first 10 of them.

head(OlsonNames(),10)
##  [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
##  [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"     
##  [7] "Africa/Bamako"      "Africa/Bangui"      "Africa/Banjul"     
## [10] "Africa/Bissau"

We will use Asia/Kolkata timezone for the example. We can extract hours, minutes, dates, year etc from the date. See the comments in the code.

time <- ymd_hms("2017-12-16 13:14:13", tz = "Asia/Kolkata")
time
## [1] "2017-12-16 13:14:13 IST"
# To extract minutes
minute(time)
## [1] 14
# To extract Hour
hour(time)
## [1] 13
# To extract Year
year(time)
## [1] 2017
# To extract Month
month(time)
## [1] 12
# To extract Date
day(time)
## [1] 16

Now let us look at how to extract weekday of the date.

wday(time)
## [1] 7

To make meaning out of this number, let us ask the package how it codes the days of week. We give abbreviation and label commands to the wday function.

wday(time, abbr = F, label = T)
## [1] Saturday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

We can compare times with different timezones also. For that we will use ‘with_tz’ function. It converts time of one timezone into another timezone.

with_tz(time, tz='Europe/Prague')
## [1] "2017-12-16 08:44:13 CET"

Some Calculations

#To Add Time
minutes(7)+seconds(10)
## [1] "7M 10S"
# To add an year to Time
time + years(1)
## [1] "2018-12-16 13:14:13 IST"
# To add 365 days = 1 duration year
time + dyears(5)
## [1] "2022-12-15 13:14:13 IST"
# To find leap years
leap_year(2000:2016)
##  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [12] FALSE  TRUE FALSE FALSE FALSE  TRUE

Let us calculate difference in time. Let us create two objects x and y using ‘ymd_hms’ function and calculate the difference between them.

x <- ymd_hms("2017-12-17 11:22:33")
y <- ymd_hms("2019-10-13 11:00:00")
y-x
## Time difference of 664.9843 days

Statistics for Time Series Data

Statistics for time series data are little different, because there are factors like seasonality, trend etc that effect the data. So the statistical tests also differ. Here we will look at some basics of those tests.

Most of the statistical tests for time series assume that the data is stationary. Stationary means that the mean and variance remains almost same over the time. So if the data is not stationary then we apply transformations to the data and make them close to stationary. So before applying any of the methods of forecasting in time series, we have to check if the data is stationary or not. One of those test is ‘Augmented Dickey Fuller Test’. This test is available in the ‘tseries’ package of R

library(tseries)
## 
## Attaching package: 'tseries'
## The following object is masked from 'package:chron':
## 
##     is.weekend

We will look at our ‘nottem’ data and apply the Dickey Fuller test on the nottem data.

head(nottem, 10)
##  [1] 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5
plot(nottem)

In this data, we can see that over time, mean and variance are almost same (atleast visually) and there is a seasonality component. So this is a stationary data. Let us confirm it with the test. We use ‘adf.test()’ function to do this test.

adf.test(nottem)
## Warning in adf.test(nottem): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  nottem
## Dickey-Fuller = -12.998, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

This function assumes that Null Hypothesis is Non Stationary data and the alternative hypothesis as Stationary data. In the above result, p-value is 0.01 (<0.05). So we reject the null hypothesis and accept the alternative hypothesis.

Autocorrelation

We use correlation to see how two variabes are linearly moving together or if if there is any relation between them. In time series, we use Autocorrelation. Autocorrelation is the correlation of data with data itself, but with some time lag. For example, relation of sales of september 2017, with sales of september 2016. So if the data is autocorrelated, it means the present data is dependent or related to the past data. We use ‘Durbin-Watson test’ of ‘lmtest’ package to confirm autocorrelation.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Let us apply ‘dwtest()’ for the nottem data. Let us find out, if there is autocorrelation with one step timelag. One Step = Y(t)-Y(t-1).

length(nottem)
## [1] 240

Nottem data has 240 observations over 20 years. We are going to check if there is correlation between the data and the immediately preceeding data. We shall chuck out the last observation for our Y variable and first observation for X variable. This will account for the lag and also have equal number of observations for both variables to perform the test.

dwtest(nottem[-240] ~ nottem[-1])
## 
##  Durbin-Watson test
## 
## data:  nottem[-240] ~ nottem[-1]
## DW = 1.0093, p-value = 5.097e-15
## alternative hypothesis: true autocorrelation is greater than 0

In the above result we see that p-value is far below 0.05. The test assumes Null Hypothesis as There is no autocorrelation (autocorrelation coefficient = 0) and alternative hypothesis as There is autocorrelation (autocorrelation coeficient >< 0). As p-value is less than 0.05, we reject the null hypothesis and accept the alternate hypothesis

But this test only tells if there is any correlation. It does not assign any quantitative value to it. For that we use AutoCorrelation Function (acf) and Partial AutoCorrelation Function (pacf). It checks the correlation with different lags, upto maximum lag specified. For example, if we specify that lag.max = 5, it checks the autocorrelations with one 1 step lag, 2 step lags, 3 steps lag, 4 steps lag and 5 steps lag and gives the correlation value. Let us apply this to our ‘nottem’ dataset.

acf(nottem, lag.max = 10, plot = F)
## 
## Autocorrelations of series 'nottem', by lag
## 
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 
##  1.000  0.808  0.452 -0.017 -0.464 -0.770 -0.876 -0.756 -0.445 -0.010 
## 0.8333 
##  0.429

‘pacf’ is used to get partial autocorrelation, which accounts for shorter time lags.

pacf(nottem, lag.max = 10, plot = F)
## 
## Partial autocorrelations of series 'nottem', by lag
## 
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 
##  0.808 -0.575 -0.553 -0.365 -0.215 -0.251 -0.223 -0.145  0.050  0.120

Predictions and Forecasting

The first step in doing forecasting or predictions for time series data is to convert the data into timeseries object. It is done using the ‘ts()’ function.

For example, let us create a random data and ‘x’ and convert it to time series object ‘x_time’. The parameters to be given for the ‘ts’ function are data, start time and number of observations for each time point (frequency).

x <- rnorm(50, 10, 6)
x_time <- ts(data = x, start = c(1956), frequency = 5)
plot(x_time)

Decomposing

A time series data has three components in it - Sesonality, Trend and White noise. Next step, to understand data, is to check the presence of these three components in the data.This helps us in deciding if there is any seasonality in the data or if there is any particular trend the data is following over time etc. We will use ‘decompose()’ function to do that. The decomposed dataset shows seasonality, trend and white noise as three different data. Adding (multiplying) these data gives the origina data in additive (multiplicative) models.

plot(nottem)

decomposed <- decompose(nottem, "additive")
names(decomposed)
## [1] "x"        "seasonal" "trend"    "random"   "figure"   "type"

‘x’ has the original data. ‘Seasonal’ has seasonal data. ‘Trend’ has the trend component. ‘type’ tells what kind of model (additive/multiplicative). Let us look at the plot of the new data.

plot(decomposed)

Smoothing

Involvement of white noise creates disturbances and spikes in the data. To get rid of that high variances and outliers, we usually smoothen time series data. Most simple smoothing technique is ‘Simple Moving Average (SMA)’ of ‘TTR’ library. Simple moving average is the consecutive ‘n’ data points from that point of time. SMA can be calculated using ‘SMA()’ function. Let us calculate and plot the SMA of ‘nottem’ data.

library(TTR)
plot(nottem)

nottem_smooth <- SMA(nottem, n=4)
plot(nottem_smooth)

ARIMA Models

ARIMA stands for Auto Regressive Integrated Moving Average. It is like liner regression for the time series data, where the dependent variable is the current time series data and the independent variable is the time lagged data. As with the linear regression, ARIMA assumes that the error terms are independent and identically distributed, have normal distribution with mean zero. It requires the data to be stationary. So if the data is not stationary, differencing should be done.

The ARIMA Model uses three other parameters along with the data. They are 1. p = Order of auto-regression 2. d = Order of differencing 3. q = Order of Moving Average

Or We can use ‘auto.arima()’ function of ‘forecast’ library to let R choose the best parameters. For R to choose best parameters, we have to provide the criterea for selection. The criteron available are ‘aic’, ‘aicc’, ‘bic’. Noe let us use ‘auto.arima()’ function for our lynx dataset.

library(forecast)
head(lynx)
## Time Series:
## Start = 1821 
## End = 1826 
## Frequency = 1 
## [1]  269  321  585  871 1475 2821
mod <- auto.arima(lynx, ic = 'aic')
mod
## Series: lynx 
## ARIMA(2,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1      ma2       mean
##       1.3421  -0.6738  -0.2027  -0.2564  1544.4039
## s.e.  0.0984   0.0801   0.1261   0.1097   131.9242
## 
## sigma^2 estimated as 761965:  log likelihood=-932.08
## AIC=1876.17   AICc=1876.95   BIC=1892.58

This result shows that ARIMA(2,0,2) is the best model for the data. To understand the information, look at the ARIMA Equation for (2,0,2)

Y(t) = Intercept + ar1 x Y(t-1) + ar2 x Y(t-1) + ma1 x Y(e-1) + ma2 x Y(e-2)

Forecasting

Forecasting can be done using ‘forecast()’ function from ‘forecast’ library. It has two parameters of importance. The ARIMA model and the future time steps for which we want to forecast. for example if we want to forecast for next 3 times of our ‘lynx’ dataset, we specify ‘h=3’

forecast(mod, h=3)
##      Point Forecast     Lo 80    Hi 80      Lo 95    Hi 95
## 1935       2989.910 1871.2360 4108.585  1279.0456 4700.775
## 1936       2093.478  397.6326 3789.323  -500.0938 4687.050
## 1937       1307.309 -516.0692 3130.687 -1481.3072 4095.925