When a variable is measured sequentially in time over or at a fixed interval, known as the sampling interval, the resulting data form a time series. Observations that have been collected over fixed sampling intervals form historical time series.
A sequence of random variables defined at fixed sampling intervals is sometimes rederred to as a discrete-time stochastic process, though the shorter name time series model is oftern preffered.
The theory of stochastic processes is vast and may be studied without necessarily fitting any models to data.
The main features of many time series are trends and seasonal variations that can be modelled deterministically with mathematical functions of time. But, another important feature of most time series is that observations close together in time tend to be correlated (serially dependent). Much of the methodology in a time series analysis is aimed at explaining this correlation and the main features in the data using appropriate statistical models and descriptive methods. Once a good model is found and fitted to data, the analyst can use the model to forecast future values, or generate simulations, to guide planning decisin. Fitted models are also used as a basis for statistical tests. Sampling intervals differ in their relation to the data. The data may have been aggregated or sampled. If data are smapled, the sampling interval must be short enough for the times series to provide a very close approximation to the original continuous signal when it is interpolated.
The data - the number of international passenger bookings(in thousands) per month - are availabe as a time series in R and illustrate several important concepts that arise in an exploratory time series analysis.
data("AirPassengers")
AP <- AirPassengers
AP
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
All data in R are stored in objects, which have a range of methods available. The class of an object can be found using the class function:
In this case, the object is of class ts, which is an abbreviation for ‘time series’. Time series have a number of methods available, which include functions start, end, and frequency given bellow.
class(AP)
## [1] "ts"
start(AP); end(AP); frequency(AP)
## [1] 1949 1
## [1] 1960 12
## [1] 12
Next, is to create a ts object from data read directly from the Internet. Ine of the most important steps is a preliminary time series analysis is to plot the data; i.e., create a time plot, For a time series object, this is achieved with the generic plot function:
plot(AP, ylab="Passengers(1000's)")
A time series plot not only emphasises patterns and features of the data but can alos expose outliers and erroneous values. One cause of the latter is that missing data are sometimes coded using a negative value. Such values need to be handled differently in the analysis and myst not be included as observations when fitting a model to data. Outlying values that cannot be attributed to some coding should be checked carefully. If they are correct, they are likely to be of particular interest and should not be excluded from the analysis. However, it may be appropriate to consider robust methods of fitting models, which reduce the influence of outliers.
To get a clearer view of the tredm, the seasonal effect can be removed by aggregating the data to the annual level, which can be achieve in R usin the aggregate funcion. A summary of the values for each season can be viewed using a boxplot, whith the cycle function being used to extract the seasosns for each item of the data. The plots can be put in a single graphics window using the layout function, which takes as input a vector (or matrix) for the location of each plot in the display window.
layout (1:2)
plot(aggregate(AP))
boxplot(AP ~ cycle(AP))
Unemployment rates are one of the main economic indicators used by politicians and other decision makers. The monthly unemployment rate for the US state of Mane from Jan 1996 until August 2006 is plotted.
In any time series analysis, it is essential to understand how the data have been collected and their unit of measurement. The US Department of Labor gives precise definitions of terms used to calculate the unmeployment rate.
The montly unemployment data are available in a fila online that is read into R in the code bellow. Note that the first row in the file containes the name of the variable (unemploy), which can be accessed directly once the attach command is given. Also, header parameter must be set to TRUE so that R treats the first row as the variable name rather than data.
www <- "Maine.dat"
Maine.month <- read.table(www, header = TRUE)
attach(Maine.month)
class(Maine.month)
## [1] "data.frame"
When we read data in this way from an ASCII text file, the ‘class’ is not time series but data.frame. The ts function is used to convert the data to a time series object. The following command creates a time series object:
Maine.month.ts <- ts(unemploy, start = c(1996, 1), freq=12)
This uses all the data. You can select a smaller number by specifying an earloer end data using parameter end. If we wish to analyse trends in the unemployment rate, annuald data will suffice. The average (mean) over the twelve months of each year is another example of aggregated data, but this time we divide by 12 to give a mean annual rate.
Maine.annual.ts <- aggregate(Maine.month.ts)/12
We plot both time series. There is clear montly variation. it seems that the February figure is typically about 20% more than the annual average, whereas the August figure tends to be roughly 20% less
layout(1:2)
plot(Maine.month.ts, ylab= "unemployed (%)")
plot(Maine.annual.ts, ylab = "unemployed(%)")
We can calculate the precise precentages in R, using window. This function will extract that part of the time series between specified start and end points and will sample with an interval equal to frequency if its argument is set to TRUE. So, the first line bellow gives a time series of February figures.
Maine.Feb <- window(Maine.month.ts, start = c(1996, 2), freq = TRUE)
Maine.Aug <- window(Maine.month.ts, start = c(1996, 8), freq = TRUE)
Feb.ratio <- mean(Maine.Feb) / mean(Maine.month.ts)
Aug.ratio <- mean(Maine.Aug) / mean(Maine.month.ts)
Feb.ratio
## [1] 1.222529
Aug.ratio
## [1] 0.8163732
On average, unemployment is 22% higher in Feb and 18 % lower in August. An explanation is that Maine attracts tourists during the summer, and this creates more jobs. Also, the period before Christmas and over the New Year’s holliday tends to have higher employment rates than the first few months of the new year. The annual unemployment rate was as high as 8.5% in 1976 but was less than 4% in 1988 and again during the three years 1999-2001. If we had sampled the data in August of each year, for example, rather than taken yearly averages, we would have consistently underestimated the unemployment rate by a factor of about 0.8
the data is available from the
www1 <- "cbe.dat"
CBE <- read.table(www1, header = TRUE)
CBE[1:4,]
## choc beer elec
## 1 1451 96.3 1497
## 2 2037 84.4 1463
## 3 2477 91.2 1648
## 4 2785 81.9 1595
class(CBE)
## [1] "data.frame"
Creating time series objects for the electricity, beer, and chocolate data. If omiting end, R uses the full length of the vector, and if omitting the month in start, R ssumes 1. I will use plot with cbind to plot several series on one figure
elec.ts <- ts(CBE[,3], start = 1958, freq = 12)
beer.ts <- ts(CBE[,2], start = 1958, freq = 12)
choc.ts <- ts(CBE[,1], start = 1958, freq = 12)
plot (cbind(elec.ts, beer.ts, choc.ts))
The plots in above figgure show increasing trends in production for all three goods, partly due to the rising population in Australia from about 10 million to about 18 million over the same period. Yet, electricity production has risen by a factor of 7, and chocolate production by a factor of 4, over this period during which the population has not quite doubled.
The three series constitute a multiple time series. There are many functions in R for handling more than one series, including ts.intersect to obtaion the intersection of two series that overlap in time. We now illustrate the use of the intersect function and point put some potentioal pitfalls in analysing multiple time series. The intersection between the air passenger data and the electricity data is obtained as follows:
ap.elec <- ts.intersect(AP, elec.ts)
Now check that the output agrees with ours, as shown bellow.
start(ap.elec)
## [1] 1958 1
end(ap.elec)
## [1] 1960 12
ap.elec[1:3,]
## AP elec.ts
## [1,] 340 1497
## [2,] 318 1463
## [3,] 362 1648
In the code bellow, the data for each series are extracted and plotted
ap <- ap.elec[,1]; elec <- ap.elec[,2]
layout(1:2)
plot(ap, main= "", ylab ="Air passenger / 1000's")
plot(elec, main="", ylab = "Electricity production / MkWh")
plot(as.vector(ap), as.vector(elec),
xlab="Air passengers / 1000's",
ylab = "Electricity production / MWh")
abline(reg = lm(elec ~ ap))
cor(ap, elec)
## [1] 0.8841668
In the plot function above, as.vector is needed to convert the ts objects to ordinary vectors suitable for a scatter plot.
The two time series are highly correlated, as can be seen in the plots, with a correlation coeffiecient of 0.88. It can be osbserved that the two time plots look similar and that the scatter plot shows an approximate linear association between the two variables. Nevertheless, it is important to realise that correlation does not imply causation. In this case, it is not plausible that higher numbers of air passengers in the United States cause, or are caused by, higher electrictiy production in Australia. A reasonable explanation for the correlation is that the increasing prosperity abd technological developmnet in both countries over this period accounts for the increasinf trends.
The two time series also happen to have similar seasonal variations. For these reasons, it is ussually appropriate to remove trends and seasonal effects before comparing multiple series. This is often achieved by working with the residuals of a regression model that has deterministic terms to represent the trend and seasonal effects.
In the simplest cases, the residuals can be modelled as independent random variations form a single distribution.
In this example we work with financial data, exchange rates, such marked patterns are less likely to be seen, and different methods of analysis are usually required. Day to day changes are more difficult to explain because the underlying causes are complex and impossible to isolate, and it will often be unrealistic to assume any deterministic component in the time series model.
The exchange rates for Britis pounds sterling to New Zeeland dollars for the period January 1991 to March 2000 are shown bellow. The data are mean values taken over wuartely periods of three months, with the first quarter being January to March and the last quarter being October to December.
www2 <- "pounds_nz.dat"
z <- read.table(www2, header = TRUE)
z[1:4,]
## [1] 2.9243 2.9422 3.1719 3.2542
z.ts <- ts(z, st =1991, fr = 4)
plot(z.ts, xlab = "time / years",
ylab = "Quartely exchange rate in $NZ / pound")
Short-term trends are apparent in the time series: After an initial surge ending in 1992, a negative trend leads to a minimum around 1996, which is followed by a positive trend in the second hald of the series. See above.
The trend seems to change direction at unpredictable times rather than displaying the relatively consistent patterns of the air passenger series and Australian production series. Such trneds have been termed stochastic trends to emphasise this randomness and to distinguish them from more deterministic trends like thise seen in the previous examples. A mathematical model know as a random walk can sometimes provide a good fit to data like these and is fitted to this series in the bellow plots. Stochastic trends are common in financial series.
Two local trends are emphasised when the series is partitioned into two subseries based on the periods 1992-1996, and 1996-1998. The window function can be used to extract the subseries:
z.92.96 <- window(z.ts, start = c(1992, 1), end = c(1996, 1))
z.96.98 <- window(z.ts, start = c(1996, 1), end = c(1998, 1))
layout(1:2)
plot(z.92.96, ylab = "Exchange rate in $NZ/pound",
xlab = "Time (years)")
plot(z.96.98, ylab = "Exchange rate in $NZ/pound",
xlab = "Time(years)")
plots above shows that the data started to follow an increasing trend. Likewise, without additional information, it would also be inadvisable to extrapolate the trend that are shown above in the plots just made. This illustrates the potential pitfall of inappropriate extrapolation of stochastic trends when underlying causes are not properly understood. To reduce the risk of making inappropriate forecast, statistical tests can be used to test for a stochastic trend.
In climate change studies, the following global temperature series, expressed as anomalies form the monthly means over the perid 1961-1990, plays a central role.
www3 <- "global.dat"
Global <- scan(www3)
global.ts <- ts(Global, st = c(1856, 1), end = c(2005, 2), fr = 12)
global.annual <- aggregate(global.ts, FUN = mean)
plot(global.ts)
plot(global.annual)
In is the trned that is of most concern, so the aggregate function is used to remove any seasonal effects within each year and produce an annual series of mean temperatures for the period 1856 to 2005. See above shown in the plost. We can avoid explicitly dividing by 12 if we specify FUN= mean in the aggregate function.
The upwards trend from about 1970 onwards has been used as evidence of global warming. In the code below, the monthly time intervals correspinding to the 36-year period 1970-2005 are extradted using the time function and the associated observed temperature seres extracted using window. The data are plotted and a line superiimpose using a regression of temperature on the new time index.
new.series <- window(global.ts, start=c(1970, 1), end = c(2005, 12))
## Warning in window.default(x, ...): 'end' value not changed
new.time <- time(new.series)
plot(new.series); abline(reg=lm(new.series~new.time))
In R, the function decompose estimates trends and seasonal effects using a moving average method. Nesting the function within plot(using plot(stl())) produces a single figure showing theoriginal series xt and the decomposed serist mt, st, and zt. For example, with the electricity data, additive and multiplicative decomposition plots are given by the commands bellow; the last plot, which uses lt to give different line types, is the superposition of the seasonal effect on the trend.
plot(decompose(elec.ts))
elec.decom <- decompose(elec.ts, type = "mult")
plot(elec.decom)
trend <- elec.decom$trend
seasonal <- elec.decom$seasonal
ts.plot(cbind(trend, trend*seasonal), lty = 1:2)
In this example, the multiplicative model would seem more appropriate than the addictive model beause the variance of the original series and trend increase with time. However, the random component, which corresponds to zt, also has an increasing variance which indicates that a log-transformation may be more appropriate for this series. The random series obtained from the decompose function is not precisesly a realisation of the random process zt but rather an estimate of that realisation. It is an estimate because it is ontained from the original time series using estmates of the trend and seasonal effects. This estimate of the realisation of the random process is residural error series.
Once we have indentified any trend and seasonal effects, we can deseasonalise the time series and remove the trend. If we use the additive decomposition method, we first caluclate the seasonally adjusted time series and then remove the trend by substraction. This leaves the random component, but the random component is not necessarily well modelled by independent random variables. In many cases, consecutive variables will be correlated. If we identify such correlations, we can improve our forecasts, quite dramatically if the correlations are high. We also need to estimate correlations if we are to generate realistic time series for simulations. The correlation structure of a time series model is defined by the correlation fucntion, and we estimate this from the observed time series.
www4 <- "Herald.dat"
herald.dat <- read.table(www4, header = T)
attach(herald.dat)
We now use R to calculate the covariance for the Herald Square pairs in three different ways:
x <- CO; y <- Benzoa; n<- length(x)
sum((x-mean(x))*(y-mean(y)))/(n-1)
## [1] 5.511042
mean((x-mean(x))*(y-mean(y)))
## [1] 5.166602
cov(x,y)
## [1] 5.511042
The correspondence between the R code above and the expectation definition of covariance should be noted:
mean((x-mean(x))*(y-mean(y))) -> E[(x-ux)(y-uy)]
Correlation is a dimensionless measure of the linear association between a pair of variables (x,y) and is obtained by standardising the covariance by dividing it by the product of the standard deviations of the variables. Correlation takes a value between -1 and +1 indicates an exact linear association, with the (x,y) pairs falling on a straight line of positive or negative slope, respectively. The correlation between the CO and benzoapyrene measurmements at Herald Square is now calculated both from the definition and usign cor.
cor(x,y)/(sd(x)*sd(y))
## [1] 0.02288027
cor(x,y)
## [1] 0.3550973
Although the correlation is small, the is nevertheless a physical explanation for the correlation because both products are a result of incomplete combustion. A correlation of 0,36 typically corresponds t a slight visual impression that y tends to increase as x increases, although the points will be well scattered.
The sample acf is defined as rk=ck/c0
We will demonstrate the calculations in R using a time series of wave heights(mm relative to still water level) measured at the centre of a wave tank. There is no rend and no seasonal period, so it is reasonable to suppose the time series is a realisation of a stationary process.
www5 <- "wave.dat"
wave.dat <- read.table(www5, header = T); attach(wave.dat)
plot(ts(wave.dat))
plot(ts(waveht[1:60]))
The upper plot shows the netire time series. There are no outlying values. The lower plot is of the first sixty wave heights. We can see that there is a tendency for consecutive values to be relatively similar and that the form is like a rough sea, with a quasi-periodicity but no fixed frequency.
The autocorrelations of x are stored in the vector acf(x)\(acf, with the lag k autocorrelation located in acf(x)\)acf[k+1]. For example, the lag 1 autocorrelation for waveht is
acf(waveht)$acf[2]
## [1] 0.4702564
Tion and alerts us to any the first entry, acf(waveht)$acf[1], is r0 and equals 1. A scatter plot, such as figure bellow for the Herald Square data, complements the calculation of the correlation and alerts us to any non-linear paterns. In a similar way, we can draw a scatter plot corresponding to each autocorrelation. In a similar way, we can draw a scatter plot corresponding to each autocorrelation. For example, for lag 1 we plot(waveht[1:396], waveht[2:397]) to obtain figure bellow.
plot(waveht[1:396], waveht[2:397])
Autocovariances are obtained by adding an argument to acf. The lag 1 autocovariance is given by
acf(waveht, type = c("covariance"))$acf[2]
## [1] 33328.39
Although we want to know about trends and seasonal patterns in a time series, we do not necessarily rely on the correlogram to indentify them.The main use of the correlogram is to detect autocorrelations in the time series after we have removed an estimate of the trend and seasonal variation. In the code bellow, the air passenger series is seasonally adjusted and the trend removed using decompose.
data("AirPassengers")
ap <- AirPassengers
ap.decomp <- decompose(ap, "multiplicative")
plot(ts(ap.decomp$random[7:138]))
acf(ap.decomp$random[7:138])
To plot the random component and draw the correlogram, we need to remember that a consequence of using a centred moving average of 12 months to smooth the time series, and thereby estmate the trend, is that the first six and last six terms in the random component cannot be calculated and are thus stored in R as NA. The random component and correlogram are show bellow.
The correlogram in the last figure above suggests either a damped cosine shapethat is characteristic of an autoregressive model of order 2 or that the seasonal adjustmnet has not been entirely effective. The latter expalanation is unlikely because the decomposition does estimate twelve independent monthly indices. If we investigate further, we see that the standard deviation of the orifical series from July until June is 109, the standard deviation of the series agter subtracting the trend estimate is 41, and the standard deviation after seasonal adjustment is just 0.03
sd(ap[7:138])
## [1] 109.4187
sd(ap[7:138] - ap.decomp$trend[7:138])
## [1] 41.11491
sd(ap.decomp$random[7:138])
## [1] 0.0333884
The reduction in the standard deviation shows that the seasonal adjustment has been very effective.
Monthly effective inflows (m3s-1) to the Fond Reservoir in Northumberland for the period from Jan 1909 until December 1980 have been provided Northumbrian Water PLC. It can be observed that there was a slight decreasing trend over this period, and substantial seasonal variation. The trend and seasonal variation have been estimated by regression and residual series (adflow). The main difference between the regression approach and using decompose is that the former assumes a linear trend, whereas the latter smooths the time series without assuming any particular form for the trend.
www6 <- "Fontdsdt.dat"
fontdsdt.dat <- read.table(www6, header = T)
attach(fontdsdt.dat)
plot(ts(adflow), ylab = 'adflow')
acf(adflow, xlab = 'lag (months)', main="")
There is a statistically significant correlation at lag 1. The physical interpretation is that the inflow next month is more likely than not to be above average if the inflow this month is above average. Similarly, if the inflow this month is elow average it is more likely thatn not that next month;s inflow will be below average. The explanation is that the groundwater supply can be thought of as a slowly discharging reservoir. If groundwater is high one month it will augment inflows, and is likely to do so next month as well. Given this explanation, you may be surprised that the lag 1 correlation is not higher. The explanation for this is that most of the inflow is runoof following rainfall, and in Nrothumberland there is ltittle correlation between seasonally adjusted rainfall in consecutive months. Ans expenential decay in the correlogram is typical of a first-order autoregressive model. The correlogram of the adjusted inflows is consistent with an exponential decauy. However, given the sampling errors for a time series of this length, estimate of autocorrelation at higher lags are unlikely to be statistically significant. This is not a practical limitation because such low correlations are inconsequential. When we come to infentify suitable models, we should remember that there is one correct model and that there will often be achoice of suitable models. We may make use of a specific statistical criterion as as Akaike;s information criterion, see bellow, to choose a model, but this does not imply that the model is correct.