November 17, 2016

Time Series data

With cross-sectional data we were (hopefully) pulling a random sample from a larger population.

In that case, we got some random noise (the error term) in each observation, but it was reasonable to assume the errors would cancel out.

Now we're looking at data that is generated from a process. We are no longer getting random samples from a population.

Example: The Cobweb model

If there's a time lag between a decisions made by different parts of the market (e.g. how much Soy farmers grow and how much consumers buy), there might be persistent errors that are related to errors from the previous period (growing season).

Example: The Cobweb model

The first period, sellers perceive a high price so they plant a lot.

Then they see a low price and plant little… which results in a higher price.

In this case, price at time \(t\) is negatively correlated with price at time \(t-1\). The data will oscillate in a (somewhat) consistent fashion.

Example: GDP

This year's GDP is basically last year's GDP, plus extra productivity from new workers and innovation, minus some productivity from workers leaving the market.

GDP at time \(t\) is positively correlated with GDP at \(t-1\).

Autcorrelation (sometimes "serial correlation")

  • Negative autocorrelation is when error terms go up and down.
  • Positive autocorrelation is when error terms follow one another. Instead of jumping up and down, they gradually sway up and down.

Problems of autocorrelation

Modelling autocorrelation

In the case of the cobweb model, we're looking at an "AR(1)" model. We can think of the error term at time \(t\) as follows:

\[u_t = \rho u_{t-1} + \varepsilon_t\]

Where \(\varepsilon_t\) is the random component.

An "AR(n)" model is one where the error term at time \(t\) is correlated with the error from the previous \(n\) periods.

Testing for autocorrelation

library(pwt8)
gdp.us <- pwt8.1 %>% filter(country == "United States of America") %>% select(rgdpo,year)
ggplot(gdp.us,aes(x = year, y = rgdpo)) + geom_line() + geom_smooth(method = "lm", se = F)

A simple option is to look at a graph of the data over time. Comparing the data to a linear time trend we can see that sometimes rGDP is higher than expected and sometimes lower.

Testing for autocorrelation

Looking at the residuals, this autocorrelation becomes even clearer.

Testing for autocorrelation

To see if this is an AR(1) process, we will make a model that treats each residual as a function of the previous residual.

df.res <- cbind(fit1$model,res = fit1$residuals)
fit.ar1 <- lm(res ~ lag(res), df.res)
summary(fit.ar1)
Call:
lm(formula = res ~ lag(res), data = df.res)

Residuals:
    Min      1Q  Median      3Q     Max 
-614285  -87928    4546   92019  323174 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.234e+04  2.168e+04  -0.569    0.571    
lag(res)     9.443e-01  3.204e-02  29.469   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 169300 on 59 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.9364,    Adjusted R-squared:  0.9353 
F-statistic: 868.4 on 1 and 59 DF,  p-value: < 2.2e-16

The basic components of time series data

Remember, our goal is to try to understand variation in Y by looking at variation in X. Normally if we see X and Y increasing together, that means X and Y are highly correlated. But now that comovement might be simply because of a time trend.

Removing this regularity in the data means we can ask about variation in X and Y around the trend.

Another regularity is seasonal variation.

Example

library(Quandl)
ny.ur <- Quandl("BLSE/LAUST360000000000004")
UR <- as.ts(ny.ur$Value, start = c(2010,1),end = c(2016,9), frequency = 12)
# stl(UR)