In this time series R guide, we are going to discuss and analyze monthly, per-cow milk production in pounds over the years 1962 to 1975. We will perform exploratory data analysis, build regression and ARIMA models, run model diagnostics, and use our model to forecast future values of cow milk production.
We will begin by reading in the Excel file of our data. We got our data from the Time Series Data Library, provided by Jonathon Cryer. We must first download a package that allows us to read in Excel files called readxl and then create a new variable that stores our data, which we will call “Cow.”
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
Cow <- read_excel("~/College/Junior/Stats/Stat 333/Cow.xls")
attach(Cow)
head(Cow)
## # A tibble: 6 x 2
## Month Prod
## <dttm> <dbl>
## 1 1962-01-01 00:00:00 589.
## 2 1962-02-01 00:00:00 561.
## 3 1962-03-01 00:00:00 640.
## 4 1962-04-01 00:00:00 656.
## 5 1962-05-01 00:00:00 727.
## 6 1962-06-01 00:00:00 697.
We will now plot our data to get an idea of what it looks like. We can also check our data to see if it has constant seasonal variance in this step. Constant seasonal variance is when the magnitude of the swings in our data is constant over time. If our data does not have constant seasonal variance, we will perform transformations to try to fix it. We simply plot our data with production as the response and the month as our predictor and specify that we want a line graph:
with(Cow, plot(Prod~Month, type = "l", xlab = "Month", ylab = "Milk Production", main = "Month vs. Milk Production"))
It looks like we have a strong increasing seasonal trend in our data. It also appears that we have pretty constant seasonal variance, meaning we have equal distances between peaks and valleys over the years. Therefore, we will not perform any transformations.
As always, we will also examine a plot of our residuals. We want constant variance in our residuals, meaning they are evenly spread above and below the center line and there is no visible trend in the data points. We can do this my creating a model and then using the plot function:
mod <- lm(Prod ~ Month)
plot(mod$residuals ~ mod$fitted.values)
abline(0,0)
We appear to have constant variance and no trend, so we will not perform any transformations. We will move forward with this model.
We can use the decompose function to view the trends of our data in different ways. Before we begin, we need to convert our data to a time series. We do this using the ts() function:
cowts <- ts(Prod, frequency = 12, start = c(1962, 1))
dec <- decompose(cowts)
plot(dec)
In the above graph, we can see that the decomposition shows our data in many different forms. Our first plot is of our observed data. The second is of the overall trend of our data, which appears to be increasing, as we saw before. The third plot is of our seasonal trend and the fourth is of our residuals.
It is common for time-ordered error terms to be autocorrelated. We have autocorrelation when our residuals are correlated through time. We assume that our residuals are independent, and if they are correlated, this condition is violated, and this can make our confidence intervals too wide or too narrow. Positive autocorrelation means that positive residuals are likely to be followed by positive residuals or that negative residuals are likely to be followed by negative residuals. Negative autocorrelation is when positive residuals are likely to be followed by negative residuals and vice versa.
We are looking at first order correlation because we are looking to see if \({\epsilon}_t\) is related to \({\epsilon}_{t-1}\) \({\epsilon}_{t+1}\). In this case, we will test for positive autocorrelation by using the Durbin Watson test. In this test, our null hypothesis that the error terms are not autocorrelated, and the alternative hypothesis is that the error terms have positive autocorrelation. When we test for positive autocorrelation, we will reject the null when our test stat is small.
plot(mod, which = 1)
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(mod)
##
## Durbin-Watson test
##
## data: mod
## DW = 0.5598, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
From our test, we have a p-value of about 0. This means that we reject the null in favor of our alternative hypothesis, so we do have evidence of positive autocorrelation. Because our model’s error terms are correlated, we will keep this in mind when assessing our model output.
We will now look at some smoothing techniques. We use smoothing when we want to see an overall trend, rather than the raw data. For example, for seasonal data, we might smooth out the seasonality so that we can identify the trend. We will perform four types of smoothing: basic moving average, weighted averages, normal kernel, and lowess.
When we use the weighted average smoothing method, we average each data point with the values on either side of it. We do this by using the filter() command to average each data point. This function uses three arguments: dataset, sides, and filter. We will set sides equal to 2, meaning we will average \({y_t}\) using \({y_{t-1}}\) and \({y_{t+1}}\), the immediate past and future values. We use the filter argument to evenly set the weight of the three data points used in the average. We can then plot both the smoothed and unsmoothed data to compare.
par(mfrow=c(2,1))
MA <- filter(Prod, sides=2, filter=rep(1/3,3))
plot.ts(MA, main="Moving Average Smoother", col = "maroon", lwd=2)
plot.ts(Prod, main="Unsmoothed Data")
The top graph shows our smoothed data using the weighted average method while the second graph is simply our raw data. As we can see, the smoothed data still shows the overall trend, just without so much noise.
To produce a more meaningful smoother, we can customize the weights of each data point. If the distance between the points we’re averaging are close, we use higher weights.
We will use the four data points on either side of each data point with weights 0.05, 0.05, 0.1, 0.15 as we move closer to our data point. We will create a vector of these weights and then use this as our weighted average filter.
wgt <- c(0.05, 0.05, 0.1, 0.15, 0.3, 0.15, 0.1, 0.05, 0.05)
weightcow <- filter(Prod, sides=2, filter=wgt)
We can now plot our weighted average smoother in purple against the original data:
plot(Prod, main="Weighted Average Smoother", type = "l")
lines(weightcow, lwd=2, col="purple")
As we can see, the purple line follows the trend of our data but makes it smoother.
In a dataset, points that are closer together could be more similar, so we use the normal kernel smoother method to decrease weights of points as they get further from \({y_t}\). The ksmooth() function does this for us:
plot(Prod, type = "l")
lines(ksmooth(time(Prod), Prod, "normal", bandwidth=2), lwd=2, col=4)
If we look at the blue line, we can see that it is more smoothed than our original plotted data, which is the black line. When we use the ksmooth function, we can specify how many surrounding points we want in our weighted average using the bandwidth argument. If we want to include more, we increase the bandwidth, as seen below:
plot(Prod, type = "l")
lines(ksmooth(time(Prod), Prod, "normal", bandwidth=4), lwd=2, col=4)
As we can see in the graph, increasing our bandwidth caused the line to become even smoother. This is because we averaged by taking into account the four surrounding data points rather than two.
Lastly, we will use the lowess smoothing method which uses nearest neighbors to smooth the trend line. The lowess smoothing method is used to see the relationship between time and cow milk production, in our case. It takes nearest neighbors and creates mini regressions from them to produce a trend line. In the R documentation, the lowess function is described as using “locally-weighted polynomial regression.”
plot(Prod, type = "l")
lines(lowess(Prod), lty=2, lwd=2, col=2)
In this case, we over smoothed a bit. This is because we used the default span. We can change this by adding a span specifier, f, and making it lower than the default:
plot(Prod, type = "l")
lines(lowess(Prod, f=.05), lwd=2, col=4)
Now that we changed our span, this looks better; we are not over smoothing anymore.
We will now create a model with monthly cow milk production as our response and time as our predictor. We will then run a t-test on our model to see if there is a significant relationship between our response and the predictor. We will use the summary() function to run the t-test; refer to our SLR R guide to learn more about t-tests.
Before we began, we numbered our months from 1 to 168 because we could not create a regression equation with the way the months were previously formatted. To do this, we created another column in our data called “Number.” We then read in this new excel file, as we did before.
library(readxl)
Coww <- read_excel("~/College/Junior/Stats/Stat 333/Coww.xls")
attach(Coww)
## The following objects are masked from Cow:
##
## Month, Prod
rmod2 <- lm(Prod ~ Number)
summary(rmod2)
##
## Call:
## lm(formula = Prod ~ Number)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.04 -50.02 -15.30 42.88 139.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 611.68235 9.41444 64.97 <2e-16 ***
## Number 1.69262 0.09663 17.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.74 on 166 degrees of freedom
## Multiple R-squared: 0.6489, Adjusted R-squared: 0.6468
## F-statistic: 306.8 on 1 and 166 DF, p-value: < 2.2e-16
We have a p-value of 2*10(-16), so we reject our null hypothesis. This means that we have significant evidence that there is a relationship between average monthly cow milk production and the number of month.
We will now use this model to predict and assess the model performance. In order to predict monthly cow milk production based on the month, we will use the predict function.
mydat <- data.frame(Number = 130)
predict(rmod2, mydat)
## 1
## 831.7223
In this example, we predicted for month number 130, which is October of 1972. Our model’s predicted value for this month’s milk production was 831.7223 pounds. The true value of cow milk production for that month is 810, so our model is a bit off.
We expect the model to be a little inaccurate because this model is only assessing the linear relationship between the two variables - it is not accounting for seasonality. So this model is predicting closer to what the average milk production will be at that time, rather than accounting for the peaks and valleys. This can be better explained by looking at our graph with the regression line included:
plot(Prod ~ Number, type = "l", xlab = "Month Number", ylab = "Milk Production", main = "Month Number vs. Milk Production")
abline(611.682, 1.693)
As we can see, this regression line disregards seasonality, so our predictions are closer to the line than the true values. So, this type of model is not ideal when we have models that include seasonality. To fix this, we will create and forecast with ARIMA models instead.
An ARIMA model is a combination of autoregressive (AR), integrated (I), and moving average (MA) models.
In an autoregressive model, we look at \(y_t\) as a linear combination of \(y_{(t-1)}\), …, \(y_{(t-p)}\). Autoregressive models are models of order p and have the following form:
\[{y_t} = {\phi_1}{y_{(t-1)}} + {\phi_2}{y_{(t-2)}} + ... + {\phi_p}{y_{(t-p)}} + {\epsilon_t}\]
We say that \({y_t}\) is stationary, \({\epsilon_t}\) is normal with mean zero and variance sigma squared, \({\phi_1} ... {\phi_p}\) are constants, and \({\phi_p}\) does not equal 0.
Next, in a moving average model, we say that \({y_t}\) is a linear combination of \({\epsilon_{t}, ..., {\epsilon_{t-q}}}\). The equation for this model is as follows:
\[{y_t} = {\mu} + {\epsilon_t} + {\theta_1}{\epsilon_{t-1}} + ... + {\theta_q}{\epsilon_{t-q}}\] In this equation, we assume that \({\epsilon_{t-q}}\) does not equal 0.
We will now combine the previous two models into an autoregressive moving average model. The equation for this ARMA model is as follows:
\[{y_t} = {\phi_1}{y_{t-1}} + ... + {\phi_p}{y_{t-p}} + {\epsilon_t} + {\mu} + {\theta_1}{\epsilon_{t-1}} + ... + {\theta_q}{\epsilon_{t-q}}\]
The above equation is the same as the full ARIMA equation with a value of d = D = 0. When this value is nonzero, the equation becomes more complicated, so we will keep it simple for now.
We will now construct an ARIMA model with a seasonal component because we have evidence of a seasonal trend in our data. The form for this type of model is ARIMA (p,d,q)(P,D,Q)S, where our lowercase p, d, and q, represent our nonseasonal data and capital P, D, and Q represent our seasonal data. S is the time span in a seasonal trend; for example, if we have S=12, this means we are looking at monthly data.
We must first download a few packages to be able to use our desired functions:
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.4.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
library(forecast)
library(xts)
attach(Cow)
## The following objects are masked from Coww:
##
## Month, Prod
## The following objects are masked from Cow (pos = 13):
##
## Month, Prod
We can now move forward with our model building.
We need to examine differenced data when we have seasonality. Seasonality usually causes the series to be nonstationary because average values for particular times within months may be different than average values within other months. For example, cow milk production will always be higher in summer months. Seasonal differencing removes seasonal trend, and we will begin by using a lag of 12 because our data is monthly.
First, we will create a difference variable using the diff() function, which uses our response variable and lag as arguments. In our example, our response variable is the number of pounds of milk each cow produces and our lag is 12 to begin with. We then plot these differences and run an adf.test with our difference variable as the argument.
When we run adf.test, we are testing to see if our data is stationary. Our null hypothesis is that the data is not stationary, and the alternative hypothesis is that the data is stationary. So we want a small p-value to be able to reject the null in favor of the alternative.
diff1 <- diff(Prod, lag = 12)
plot.ts(diff1)
adf.test(diff1)
##
## Augmented Dickey-Fuller Test
##
## data: diff1
## Dickey-Fuller = -2.8615, Lag order = 5, p-value = 0.2172
## alternative hypothesis: stationary
Looking at the plot and the output of our test, we have a p-value of .2172, which is higher than our significance level of .05. This tells us that we cannot reject the null hypothesis, meaning our data is not stationary and we still have a trend present. Therefore, we need non-seasonal differencing.
We will now run the same steps but with a lag of 1 and see if we can remove the trend:
diff2 <- diff(Prod, lag = 1)
plot.ts(diff2)
adf.test(diff2)
## Warning in adf.test(diff2): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff2
## Dickey-Fuller = -9.5083, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
We now have a p-value of .01, so we can reject the null and say that our data is stationary. Because we were successful in removing the trend with only performing first-order differencing for seasonal and nonseasonal components, we have d = D = 1.
To choose our values of p, q, P, and Q, we use the autocorrelation function (ACF) and partial autocorrelation function (PACF).
We will begin by using the acf() function to find our values of q and Q. We use our differenced data as our argument:
acf(diff2)
To determine our value of q, we look for spikes at low lags of our ACF graph. It looks like we have a spike at around 0, so we will say that q = 1. To determine our value of Q, we look for spikes at lags that are multiples of S in our ACF graph. Our S is equal to 12, so we look for spikes around values of 12 and 24. We have a spike at 12, so we say that Q = 1.
We will now use the pacf() function to determine our values of p and P. Again, we use our differenced data as our argument:
pacf(diff2)
To determine our value of p, we look for spikes at low lags of our PACF graph. We do not seem to have any spikes at low lags, so we will say that p = 0. To determine our value of P, we look for spikes at lags that are multiples of S in our PACF graph. Again, we look at values around 12 and 24. We do seem to have spikes around 12, so we will set our P = 1.
We now create our model with our determined p, q, d, P, Q, D and S values, as follows:
mod1 <- arima(cowts, order = c(0,1,1), season = list( order = c( 1,1,1), period=12))
We will now look at more specifics of the model that we have created. To run diagnostics on our models, we want to analyze the residuals. We can do this using a hypothesis test that uses the Ljung Box statistic.
If our ARIMA model is doing a good job, it should be accounting for the dependency of \(y_t\)s and the residuals should be small. Our Ljung Box Statistic will also then be small.
In this test, our null hypothesis is that our model is adequate. Our alternative hypothesis is that the model is inadequate (meaning our residuals are too big which creates a large test statistic).
If we reject the null hypothesis, we go back to the drawing board to find better values of p, d, q, P, D, or Q.
We first run a summary of our model to see the coefficients of each of our p, d, q, P, D, and Q values. We then use the Box.test function with our residuals as the argument and specify the type of test as “Ljung.” We will also run the AIC of our model for future comparative purposes.
summary(mod1)
##
## Call:
## arima(x = cowts, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12))
##
## Coefficients:
## ma1 sar1 sma1
## -0.2218 -0.0617 -0.5868
## s.e. 0.0746 0.1183 0.0948
##
## sigma^2 estimated as 52.58: log likelihood = -530.02, aic = 1068.03
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.03150701 6.966739 5.140968 -0.004158041 0.6862361 0.1317995
## ACF1
## Training set -0.008603438
Box.test(mod1$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.012659, df = 1, p-value = 0.9104
AIC(mod1)
## [1] 1068.034
Because our p-value is larger than .05, so do not reject the null hypothesis; this means that our model is adequate. However, it looks like our sar1 value is not significantly different from 0, so we will try creating another model with P = 0. The AIC value of this model is 1068.034, so we will see if setting P = 0 causes this to decrease.
mod2 <- arima(cowts, order = c(0,1,1), season = list( order = c( 0,1,1), period=12))
summary(mod2)
##
## Call:
## arima(x = cowts, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
##
## Coefficients:
## ma1 sma1
## -0.2204 -0.6214
## s.e. 0.0748 0.0627
##
## sigma^2 estimated as 52.7: log likelihood = -530.15, aic = 1066.3
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0356593 6.975173 5.168896 -0.003681841 0.6897827 0.1325155
## ACF1
## Training set -0.00843703
Box.test(mod2$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod2$residuals
## X-squared = 0.012174, df = 1, p-value = 0.9121
AIC(mod2)
## [1] 1066.301
Our p-value has increased and our AIC value has decreased to 1066.301, so it looks like this is the better model.
Out of curiosity, we will compare our results with the results of an automated process. We create a model using the auto.arima function with our time series data as the argument. We can then print the results of the model:
automod <- auto.arima(cowts)
automod
## Series: cowts
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.2204 -0.6214
## s.e. 0.0748 0.0627
##
## sigma^2 estimated as 53.42: log likelihood=-530.15
## AIC=1066.3 AICc=1066.46 BIC=1075.43
Our model is very close to this automated model. We can look at the AIC of both models to see which is better and run another Box Ljung test to make sure our model is adequate:
AIC(automod); AIC(mod2)
## [1] 1066.301
## [1] 1066.301
Box.test(automod$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: automod$residuals
## X-squared = 0.012174, df = 1, p-value = 0.9121
Our model and the automated model have the same AIC value, so they are equally accurate. The p-value for Box Ljung test on the automated model is .9121, so we accept our null hypothesis that the model is adequate.
We can now forecast using both our model and the model that auto.arima created. We will expect these forecasts to be very similar since the models were so similar.
par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(mod2, h = 12), main = "Our Model")
We can also forecast the pounds of milk production for a specific number of months beyond where our dataset ends. We use the forecast() function with our model as the argument and then set a variable h as the number of months into the future we want to forecast. We will forecast 12 months beyond our dataset:
forecast(mod2, h = 12)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1976 864.9773 855.6736 874.2809 850.7485 879.2060
## Feb 1976 817.7493 805.9522 829.5464 799.7071 835.7914
## Mar 1976 924.4056 910.5569 938.2543 903.2259 945.5853
## Apr 1976 937.4836 921.8503 953.1169 913.5746 961.3927
## May 1976 1000.6235 983.3894 1017.8576 974.2662 1026.9807
## Jun 1976 973.2165 954.5181 991.9148 944.6198 1001.8131
## Jul 1976 931.8501 911.7941 951.9060 901.1772 962.5230
## Aug 1976 892.2597 870.9324 913.5871 859.6424 924.8771
## Sep 1976 846.3679 823.8408 868.8949 811.9157 880.8201
## Oct 1976 851.5326 827.8665 875.1987 815.3385 887.7267
## Nov 1976 817.4931 792.7404 842.2458 779.6371 855.3491
## Dec 1976 859.7534 833.9598 885.5470 820.3055 899.2013
This gives us point estimates and 80 and 95 percent confidence intervals for the average pounds of milk produced per cow in each month. For example, this estimates that in December of 1976, each cow would produce 859.7534 pounds of milk on average. If we wanted to use these estimates for a purpose and therefore wanted to have a bit more room for error in our estimates, we could say that we are 80% confident that each cow will produce between 833.9598 and 885.547 pounds of milk in December of 1976 on average. If we wanted to play it even safer, we could say that we are 95% confident that each cow will produce between 820.3055 and 899.2013 pounds of milk in December of 1976 on average.
From the USDA’s website, we found that the average milk production per cow was actually 875 pounds in December of 1976, so our prediction intervals were pretty accurate. This is clearly a better method of prediction than the method we used based of off a linear model.
That’s it; we’re done!