In this workshop we illustrate how we use R to do elementary time series analysis. In particular we use an example on a hotel occupancy to illustrate seasonality. We also use R to do interactive data visualisation on time series using the package dygraph. We first load the required libraries and set the working directory, then check the data types of all the columns.
library("pacman")
## Warning: package 'pacman' was built under R version 3.4.4
pacman::p_load(tidyverse, lubridate,zoo,forecast, fUnitRoots,dygraphs)
The example is the hotel occupancy which tends to follow seasons by the month. You need to change the working directory for yourself.
setwd("C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102/6. Day_6/Data")
data = read.csv('Hoteldata.csv')
sapply(data, class)
## t Yt t.1 S1 S2 S3 S4
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## S5 S6 S7 S8 S9 S10 S11
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## Yt_trans
## "numeric"
First, take a look at the ‘data’ and initial few entries. Note that the Yt_trans data is transformed by using a logarithm. The s1, s2,… refers to the seasonal dummies, as spoken.
str(data)
## 'data.frame': 168 obs. of 15 variables:
## $ t : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Yt : int 501 488 504 578 545 632 728 725 585 542 ...
## $ t.1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ S1 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ S2 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ S3 : int 0 0 1 0 0 0 0 0 0 0 ...
## $ S4 : int 0 0 0 1 0 0 0 0 0 0 ...
## $ S5 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ S6 : int 0 0 0 0 0 1 0 0 0 0 ...
## $ S7 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ S8 : int 0 0 0 0 0 0 0 1 0 0 ...
## $ S9 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ S10 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ S11 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Yt_trans: num 6.22 6.19 6.22 6.36 6.3 6.45 6.59 6.59 6.37 6.3 ...
head(data, 3)
Plot the data and observe the time series. Does it appear to have a trend and seasons?
plot(data$t,data$Yt,"l")
We model the data using the function ‘lm’ on the transformed data and examine the results. Note the coefficients. Are they generally significant?
model = lm(Yt_trans ~ t.1+S1+S2+S3+S4+S5+S6+S7+S8+S9+S10+S11, data=data)
summary(model)
##
## Call:
## lm(formula = Yt_trans ~ t.1 + S1 + S2 + S3 + S4 + S5 + S6 + S7 +
## S8 + S9 + S10 + S11, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.055274 -0.012606 0.001915 0.013970 0.046961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2882830 0.0064864 969.462 < 2e-16 ***
## t.1 0.0027254 0.0000341 79.922 < 2e-16 ***
## S1 -0.0421632 0.0080900 -5.212 5.89e-07 ***
## S2 -0.1120314 0.0080885 -13.851 < 2e-16 ***
## S3 -0.0840426 0.0080871 -10.392 < 2e-16 ***
## S4 0.0389463 0.0080859 4.817 3.45e-06 ***
## S5 0.0197923 0.0080848 2.448 0.0155 *
## S6 0.1463526 0.0080839 18.104 < 2e-16 ***
## S7 0.2886271 0.0080831 35.708 < 2e-16 ***
## S8 0.3109017 0.0080824 38.466 < 2e-16 ***
## S9 0.0553191 0.0080819 6.845 1.68e-10 ***
## S10 0.0397366 0.0080816 4.917 2.22e-06 ***
## S11 -0.1122746 0.0080814 -13.893 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02138 on 155 degrees of freedom
## Multiple R-squared: 0.9884, Adjusted R-squared: 0.9876
## F-statistic: 1105 on 12 and 155 DF, p-value: < 2.2e-16
Since the transformed Y values are by logarithm, we reverse transform it to get the predicted variables.
data$predicted = exp(model$fitted.values)
Let’s plot the data with the residuals. They overlap quite well; to observe the differences, consider the residuals in the next section.
ggplot(data, aes(t)) +
geom_line(aes(y = Yt, colour = "Actual Sales")) +
geom_line(aes(y = predicted, colour = "Predicted Sales"))
res = residuals(model)
plot(jitter(res)~jitter(Yt_trans), ylab="Residuals", xlab="Yt", data=data)
abline(0,0)
Let’s now consider a different file where R can create a dummy internally.
data = read.csv('C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102/6. Day_6/Data/Hoteldata_subset.csv')
Visualise the data first.
head(data, n=2)
sapply(data, class)
## t Month Yt Yt_trans
## "integer" "factor" "integer" "numeric"
Run the regression model using the lm function. Note that the variables are auto-generated as compared to the earlier case of setting dummies manually.
model = lm(data$Yt_trans ~ t+Month, data=data)
summary(model)
##
## Call:
## lm(formula = data$Yt_trans ~ t + Month, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0255286 -0.0054610 0.0004604 0.0061223 0.0214604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.748e+00 2.734e-03 1005.165 < 2e-16 ***
## t 1.185e-03 1.465e-05 80.842 < 2e-16 ***
## MonthAug 1.178e-01 3.473e-03 33.907 < 2e-16 ***
## MonthDec -1.733e-02 3.475e-03 -4.989 1.61e-06 ***
## MonthFeb -6.585e-02 3.473e-03 -18.961 < 2e-16 ***
## MonthJan -3.537e-02 3.473e-03 -10.186 < 2e-16 ***
## MonthJul 1.082e-01 3.473e-03 31.165 < 2e-16 ***
## MonthJun 4.649e-02 3.473e-03 13.387 < 2e-16 ***
## MonthMar -5.410e-02 3.473e-03 -15.579 < 2e-16 ***
## MonthMay -8.327e-03 3.473e-03 -2.398 0.0177 *
## MonthNov -6.615e-02 3.474e-03 -19.041 < 2e-16 ***
## MonthOct -1.791e-04 3.474e-03 -0.052 0.9589
## MonthSep 7.148e-03 3.473e-03 2.058 0.0413 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009188 on 155 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9878
## F-statistic: 1131 on 12 and 155 DF, p-value: < 2.2e-16
Observe that the base level is not Dec (previously) but April (as there is no April coefficient above) as R chooses it internally.
res <- residuals(model)
plot(jitter(res)~jitter(Yt_trans), ylab="Residuals", xlab="Y", data=data)
abline(0,0)
We can also run the Durbin-Watson test for if the residuals are auto-correlated. In the DW test, the null hypothesis is that there is no correlation among the residuals, ie they are independent. The alternative hypothesis is the residuals are auto-correlated. The ‘golden number’ is DW stat=2. If it is <2, the residuals are positively auto-correlated; and if there are >2 they are negatively auto-correlated. What do you think of the results below?
library(lmtest)
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.1899, p-value = 1.703e-07
## alternative hypothesis: true autocorrelation is greater than 0
Let’s run the prediction.
data$predicted = 10^(model$fitted.values)
ggplot(data, aes(t)) +
geom_line(aes(y = Yt, colour = "Actual")) +
geom_line(aes(y = predicted, colour = "Predicted occupancy"))
Our next example is on data visualisation using the dygraphs library and also using the xts/ts time series object. The xts is an extended form of ts. We use bitcoin prices as an example.
bitcoin = read.csv('C:/Users/t_car/OneDrive/Documents/MTech/5. EB5102/6. Day_6/Data/bitcoin.csv')
Observe the data. Now change the date factor to a date format.
str(bitcoin)
## 'data.frame': 92 obs. of 2 variables:
## $ Date : Factor w/ 92 levels "2018-04-23 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ ClosePrice: num 8938 9652 8864 9279 8978 ...
bitcoin$date <- as.Date(bitcoin$Date, format="%Y-%m-%d")
head(bitcoin, n =4)
Check if it is converted to a xts object. An advantage of xts is that you can do zooming into grnaular data. It automatically interpolates the granular data for you.
library(xts)
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
bitcoin_ts = xts(bitcoin$ClosePrice, order.by = bitcoin$date)
is.xts(bitcoin_ts)
## [1] TRUE
head(bitcoin_ts,5)
## [,1]
## 2018-04-23 8938.30
## 2018-04-24 9652.16
## 2018-04-25 8864.09
## 2018-04-26 9279.00
## 2018-04-27 8978.33
library(dygraphs)
dygraph(bitcoin_ts) %>%
dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
dyRangeSelector() %>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) %>%
dyRoller(rollPeriod = 1)
Now, play around with some of the functions in ts function.
start(bitcoin_ts)
## [1] "2018-04-23"
end(bitcoin_ts)
## [1] "2018-07-23"
frequency(bitcoin_ts)
## [1] 1
deltat(bitcoin_ts)
## [1] 1
summary(bitcoin_ts)
## Index bitcoin_ts
## Min. :2018-04-23 Min. :5848
## 1st Qu.:2018-05-15 1st Qu.:6576
## Median :2018-06-07 Median :7432
## Mean :2018-06-07 Mean :7541
## 3rd Qu.:2018-06-30 3rd Qu.:8422
## Max. :2018-07-23 Max. :9827
# ts.plot(bitcoin_ts)
Let’s plot the auto-correlations of the time series. The correlogram can also hint to you if the series is stationary (recall?)
acf(bitcoin_ts, 12, lwd=1)
Also test for overall randomness - the sum of auto-correlation. Look at the results. It is far far from zero auto-correlation. Note that most trending data have normally high auto-correlation. How is the Ljung-Box compared to Box-Pierce test?
Box.test(bitcoin_ts, lag=12, type="Ljung")
##
## Box-Ljung test
##
## data: bitcoin_ts
## X-squared = 739.81, df = 12, p-value < 2.2e-16
Let’s check for some properties of the time series - stationarity that we studied in class. We use augmented Dicky-Fuller test. Look at the results. Is it stationary?
adfTest(bitcoin_ts)
## Warning in if (class(x) == "timeSeries") x = series(x): the condition has
## length > 1 and only the first element will be used
##
## Title:
## Augmented Dickey-Fuller Test
##
## Test Results:
## PARAMETER:
## Lag Order: 1
## STATISTIC:
## Dickey-Fuller: -1.1644
## P VALUE:
## 0.2419
##
## Description:
## Sun Nov 11 21:15:04 2018 by user: t_car
In the workshop folder, there is a file for “ethereum.csv” which provide the data by the minute. Using the same procedure (with some modification), repeat the above exercise Note that data format is in terms of “%Y-%m-%d %h-%m-%s”. A good reference is https://www.stat.berkeley.edu/~s133/dates.html