Data Analytics Workshop

Introduction to Time Series Analysis

In this workshop we illustrate how we use R to do elementary time series analysis. In particular we use an example on a hotel occupancy to illustrate seasonality. We also use R to do interactive data visualisation on time series using the package dygraph. We first load the required libraries and set the working directory, then check the data types of all the columns.

library("pacman")

## Warning: package 'pacman' was built under R version 3.4.4

pacman::p_load(tidyverse, lubridate,zoo,forecast, fUnitRoots,dygraphs)

The example is the hotel occupancy which tends to follow seasons by the month. You need to change the working directory for yourself.

setwd("C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102/6. Day_6/Data")
data = read.csv('Hoteldata.csv')
sapply(data, class)

##         t        Yt       t.1        S1        S2        S3        S4 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##        S5        S6        S7        S8        S9       S10       S11 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  Yt_trans 
## "numeric"

First, take a look at the ‘data’ and initial few entries. Note that the Yt_trans data is transformed by using a logarithm. The s1, s2,… refers to the seasonal dummies, as spoken.

str(data)

## 'data.frame':    168 obs. of  15 variables:
##  $ t       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Yt      : int  501 488 504 578 545 632 728 725 585 542 ...
##  $ t.1     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ S1      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ S2      : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ S3      : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ S4      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ S5      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ S6      : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ S7      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ S8      : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ S9      : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ S10     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ S11     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Yt_trans: num  6.22 6.19 6.22 6.36 6.3 6.45 6.59 6.59 6.37 6.3 ...

head(data, 3)

Plot the data and observe the time series. Does it appear to have a trend and seasons?

plot(data$t,data$Yt,"l")

We model the data using the function ‘lm’ on the transformed data and examine the results. Note the coefficients. Are they generally significant?

model = lm(Yt_trans ~ t.1+S1+S2+S3+S4+S5+S6+S7+S8+S9+S10+S11, data=data)
summary(model)

## 
## Call:
## lm(formula = Yt_trans ~ t.1 + S1 + S2 + S3 + S4 + S5 + S6 + S7 + 
##     S8 + S9 + S10 + S11, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055274 -0.012606  0.001915  0.013970  0.046961 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.2882830  0.0064864 969.462  < 2e-16 ***
## t.1          0.0027254  0.0000341  79.922  < 2e-16 ***
## S1          -0.0421632  0.0080900  -5.212 5.89e-07 ***
## S2          -0.1120314  0.0080885 -13.851  < 2e-16 ***
## S3          -0.0840426  0.0080871 -10.392  < 2e-16 ***
## S4           0.0389463  0.0080859   4.817 3.45e-06 ***
## S5           0.0197923  0.0080848   2.448   0.0155 *  
## S6           0.1463526  0.0080839  18.104  < 2e-16 ***
## S7           0.2886271  0.0080831  35.708  < 2e-16 ***
## S8           0.3109017  0.0080824  38.466  < 2e-16 ***
## S9           0.0553191  0.0080819   6.845 1.68e-10 ***
## S10          0.0397366  0.0080816   4.917 2.22e-06 ***
## S11         -0.1122746  0.0080814 -13.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02138 on 155 degrees of freedom
## Multiple R-squared:  0.9884, Adjusted R-squared:  0.9876 
## F-statistic:  1105 on 12 and 155 DF,  p-value: < 2.2e-16

Since the transformed Y values are by logarithm, we reverse transform it to get the predicted variables.

data$predicted = exp(model$fitted.values)

Let’s plot the data with the residuals. They overlap quite well; to observe the differences, consider the residuals in the next section.

ggplot(data, aes(t)) + 
  geom_line(aes(y = Yt, colour = "Actual Sales")) + 
  geom_line(aes(y = predicted, colour = "Predicted Sales"))

res = residuals(model)

plot(jitter(res)~jitter(Yt_trans), ylab="Residuals", xlab="Yt", data=data)
abline(0,0)

Let’s now consider a different file where R can create a dummy internally.

data = read.csv('C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102/6. Day_6/Data/Hoteldata_subset.csv')

Visualise the data first.

head(data, n=2)

sapply(data, class)

##         t     Month        Yt  Yt_trans 
## "integer"  "factor" "integer" "numeric"

Run the regression model using the lm function. Note that the variables are auto-generated as compared to the earlier case of setting dummies manually.

model = lm(data$Yt_trans ~ t+Month, data=data)
summary(model)

## 
## Call:
## lm(formula = data$Yt_trans ~ t + Month, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0255286 -0.0054610  0.0004604  0.0061223  0.0214604 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  2.748e+00  2.734e-03 1005.165  < 2e-16 ***
## t            1.185e-03  1.465e-05   80.842  < 2e-16 ***
## MonthAug     1.178e-01  3.473e-03   33.907  < 2e-16 ***
## MonthDec    -1.733e-02  3.475e-03   -4.989 1.61e-06 ***
## MonthFeb    -6.585e-02  3.473e-03  -18.961  < 2e-16 ***
## MonthJan    -3.537e-02  3.473e-03  -10.186  < 2e-16 ***
## MonthJul     1.082e-01  3.473e-03   31.165  < 2e-16 ***
## MonthJun     4.649e-02  3.473e-03   13.387  < 2e-16 ***
## MonthMar    -5.410e-02  3.473e-03  -15.579  < 2e-16 ***
## MonthMay    -8.327e-03  3.473e-03   -2.398   0.0177 *  
## MonthNov    -6.615e-02  3.474e-03  -19.041  < 2e-16 ***
## MonthOct    -1.791e-04  3.474e-03   -0.052   0.9589    
## MonthSep     7.148e-03  3.473e-03    2.058   0.0413 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009188 on 155 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9878 
## F-statistic:  1131 on 12 and 155 DF,  p-value: < 2.2e-16

Observe that the base level is not Dec (previously) but April (as there is no April coefficient above) as R chooses it internally.

res <- residuals(model)
plot(jitter(res)~jitter(Yt_trans), ylab="Residuals", xlab="Y", data=data)
abline(0,0)

We can also run the Durbin-Watson test for if the residuals are auto-correlated. In the DW test, the null hypothesis is that there is no correlation among the residuals, ie they are independent. The alternative hypothesis is the residuals are auto-correlated. The ‘golden number’ is DW stat=2. If it is <2, the residuals are positively auto-correlated; and if there are >2 they are negatively auto-correlated. What do you think of the results below?

library(lmtest)
dwtest(model)

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.1899, p-value = 1.703e-07
## alternative hypothesis: true autocorrelation is greater than 0

Let’s run the prediction.

data$predicted = 10^(model$fitted.values)
ggplot(data, aes(t)) + 
  geom_line(aes(y = Yt, colour = "Actual")) + 
  geom_line(aes(y = predicted, colour = "Predicted occupancy"))

Our next example is on data visualisation using the dygraphs library and also using the xts/ts time series object. The xts is an extended form of ts. We use bitcoin prices as an example.

bitcoin = read.csv('C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102/6. Day_6/Data/bitcoin.csv')

Observe the data. Now change the date factor to a date format.

str(bitcoin)

## 'data.frame':    92 obs. of  2 variables:
##  $ Date      : Factor w/ 92 levels "2018-04-23 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ClosePrice: num  8938 9652 8864 9279 8978 ...

bitcoin$date <- as.Date(bitcoin$Date, format="%Y-%m-%d")
head(bitcoin, n =4)

Check if it is converted to a xts object. An advantage of xts is that you can do zooming into grnaular data. It automatically interpolates the granular data for you.

library(xts)

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

bitcoin_ts = xts(bitcoin$ClosePrice, order.by = bitcoin$date)
is.xts(bitcoin_ts)

## [1] TRUE

head(bitcoin_ts,5)

##               [,1]
## 2018-04-23 8938.30
## 2018-04-24 9652.16
## 2018-04-25 8864.09
## 2018-04-26 9279.00
## 2018-04-27 8978.33

library(dygraphs)
dygraph(bitcoin_ts) %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1)

Now, play around with some of the functions in ts function.

start(bitcoin_ts)

## [1] "2018-04-23"

end(bitcoin_ts)

## [1] "2018-07-23"

frequency(bitcoin_ts)

## [1] 1

deltat(bitcoin_ts)

## [1] 1

summary(bitcoin_ts)

##      Index              bitcoin_ts  
##  Min.   :2018-04-23   Min.   :5848  
##  1st Qu.:2018-05-15   1st Qu.:6576  
##  Median :2018-06-07   Median :7432  
##  Mean   :2018-06-07   Mean   :7541  
##  3rd Qu.:2018-06-30   3rd Qu.:8422  
##  Max.   :2018-07-23   Max.   :9827

# ts.plot(bitcoin_ts)

Test for auto-correlations

Let’s plot the auto-correlations of the time series. The correlogram can also hint to you if the series is stationary (recall?)

acf(bitcoin_ts, 12, lwd=1)

Also test for overall randomness - the sum of auto-correlation. Look at the results. It is far far from zero auto-correlation. Note that most trending data have normally high auto-correlation. How is the Ljung-Box compared to Box-Pierce test?

Box.test(bitcoin_ts, lag=12, type="Ljung")

## 
##  Box-Ljung test
## 
## data:  bitcoin_ts
## X-squared = 739.81, df = 12, p-value < 2.2e-16

Test for stationarity

Let’s check for some properties of the time series - stationarity that we studied in class. We use augmented Dicky-Fuller test. Look at the results. Is it stationary?

adfTest(bitcoin_ts)

## Warning in if (class(x) == "timeSeries") x = series(x): the condition has
## length > 1 and only the first element will be used

## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 1
##   STATISTIC:
##     Dickey-Fuller: -1.1644
##   P VALUE:
##     0.2419 
## 
## Description:
##  Sun Nov 11 21:15:04 2018 by user: t_car

Exercise:

In the workshop folder, there is a file for “ethereum.csv” which provide the data by the minute. Using the same procedure (with some modification), repeat the above exercise Note that data format is in terms of “%Y-%m-%d %h-%m-%s”. A good reference is https://www.stat.berkeley.edu/~s133/dates.html

Data Analytics Workshop

Eric Tham, National University of Singapore - ISS

July 2018

Introduction to Time Series Analysis

Test for auto-correlations

Test for stationarity

Exercise: