1 General directions for this Workshop

You will work in RStudio. Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop.

At the beginning of the R Notebook write Workshop 2 - Financial Econometrics II and your name (as we did in previous workshop).

Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop. More specifically, you have to:

  • Replicate all the R Code along with its output.

  • You have to do whatever is asked in the workshop. It can be: a) Responses to specific questions and/or do an exercise/challenge.

Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook. Your own workshop/notebook will be very helpful for your further study.

You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file.

2 Introduction to ARIMA-SARIMA Models

A seasonal ARIMA model, also called ARIMA/SARIMA model, is usually applied to monthly or quarterly time series. This model is very effective for any of the following type of economic/financial series:

  1. Historical monthly or quarterly sales data

  2. Historical monthly or quarterly price data

  3. Historical monthly or quarterly data related to any variable of any financial statement. For example: cost of good sold, sales and general administrative expenses, income tax, EBIT, etc.

  4. Historical monthly or quarterly data related to any macro economic variable. For example: inflation indexes, GDP indexes, exchange rates, etc.

These models are not so effective for daily stock prices or financial indexes. The main reason is that one of the most important conditions for a financial market to exist is that no body can predict stock prices since the log of stock prices are supposed to behave as a Random Walk! However, these models are useful in this context compared to just doing intuitive guesses.

3 Calibration Steps for ARIMA-SARIMA

Each ARIMA-SARIMA model needs to be CALIBRATED. The general steps to calibrate a model are the following:

  1. Check if the series has seasonality. If the series shows a seasonal pattern, start with the seasonal difference of the log of the series, which is the % annual growth of the series. Usually monthly and quarterly data of many economic and financial series have seasonal pattern.

I recommend that if you have monthly or quarterly data of any financial or economic variable, start testing whether the seasonal difference of the log of the series is stationary. If you have daily data, it is hard to model for seasonality. In the case of daily data, start testing whether the first difference of the log is stationary.

In this step you start with a version of the series.

For next steps we will assume that we decided to use the seasonal difference of the log of the series.

  1. Test whether the seasonal difference is stationary. If so, continue to next step. If not, then change the variable to the first difference of the log of the series, which is the % change for each period. Most of the economic and financial series become stationary with the first log difference.

  2. Run the ACF and PACF plots to identify the ARIMA-SARIMA parameters p, d, q, P, D, Q. The ACF is the auto-correlation plot, and the PACF is the partial autocorrelation plot. Both plots show auto*correlations between the variable and its own lags.

An ARIMA/SARIMA model has the following “parameters” that need to be defined:

arima(p,d,q) sarima(P,D,Q,#Periods)

  • p: refers to the number of autoregressive (AR) terms. Usually this parameter is either 0,1 or 2.

  • d: refers to how many first differences where needed to the series in order to make the series stationary. Usually this parameter is either 0 or 1.

  • q: refers to the number of moving average (MA) terms in the model. Usually this parameter is either 0,1 or 2.

  • P: refers to the number of SEASONAL autoregressive terms. Usually this parameter is either 0 or 1.

  • D: refers to how many SEASONAL differences were needed to the series to make the series stationary. Usually this parameter is either 0 or 1.

  • Q: refers to the number of SEASONAL moving average terms. Usually this parameter is either 0 or 1.

  • #periods: refers to the number of periods in the year. For example, if the data is monthly, then #periods=12.

These parameters are usually expressed as: ARIMA(p,d,q) SARIMA(P,D,Q, #periods).

When you identify the first values of these paremeters, then you will have your first CALIBRATION of the model.

  1. Estimate/Run the ARIMA-SARIMA model

  2. Run the ACF and PACF of the residuals/errors of the model to check whether the errors is a white noise series. In other words, if there is no significant autocorrelations of the errors.

If there is one or more significant autocorrelations, we can include other term(s) in the ARIMA-SARIMA model, and go back to step 4.

If there the errors seem like a white noise (no signficant autocorrelations), then we can continue and finish the calibration process.

  1. Interpret the model

  2. Run a forecast using the model

Let’s start practicing with these models.

4 Example - model for air passengers

4.1 Identifying seasonality in a time-series

We will work with a simple dataset that has the number of air passengers of an US Airline. This series might have a strong seasonality since people used to fly in vacation periods, which is the the same each year.

Download the excel file from a web site:

download.file("http://www.apradie.com/ec2004/air2.xlsx", "air2.xlsx", mode="wb")

In order to read Excel files, we need the readxl package:

library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5

Now, we read the data from the Excel file:

dataset <- read_excel("air2.xlsx")
# I familiarize with the data
head(dataset)
## # A tibble: 6 x 4
##     air  time     t month              
##   <dbl> <dbl> <dbl> <dttm>             
## 1   112 1949      1 1949-01-01 00:00:00
## 2   118 1949.     2 1949-02-01 00:00:00
## 3   132 1949.     3 1949-03-01 00:00:00
## 4   129 1949.     4 1949-04-01 00:00:00
## 5   121 1949.     5 1949-05-01 00:00:00
## 6   135 1949.     6 1949-06-01 00:00:00

In order to perform time-series analysis, it is better to use xts objects. We can construct this by using the xts() method of the xts package. Objects of class xts are composed by 2 elements: the core data and the index (class Date, POSIX,etc).

# Load library xts and zoo. Install the packages if you haven't already.
library(xts)
library(zoo)
library(tseries)
library(forecast)


# I get the date of the dataset:
date <- dataset$month
# I join both objects using xts() and assign it to a new object called air.xts
air.xts<- xts(x=dataset$air, frequency = "monthly", order.by=date)
names(air.xts)<-c("Passengers")
class(air.xts)
## [1] "xts" "zoo"
# I see behaviour of air passengers over time 
plot(air.xts$Passengers)

This dataset contains air passengers by month from an airline. Do the following:

Does the series look stationary? Obviously not, it looks with a clear growing trend, and a seasonality pattern. The mean of the series GROWS OVER TIME.

Get the natural logarithm of the series:

lnair.xts = log(air.xts)

4.2 Testing for stationary

Since we see a clear seasonality pattern, it is strongly recommended to use the seasonal difference of the log of the series. So, we will check whether we can treat the seasonal difference of the log as STATIONARY.

We start by plotting the seasonal difference of the log, which is the annual % growth (in cc) of the air passengers:

plot(diff(lnair.xts,lag=12))

It is hard to see whether this is stationary or not, so we have to run the Dicky-Fuller test. Run the Dicky-Fuller test using the adf.test function of the tseries packages.

adf.test(na.omit(diff(lnair.xts,lag=12)),alternative="stationary",k=0)
## Warning in adf.test(na.omit(diff(lnair.xts, lag = 12)), alternative =
## "stationary", : p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(diff(lnair.xts, lag = 12))
## Dickey-Fuller = -4.988, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

We got a p-value<0.05, so we can treat the seasonal difference (the annual % growth) as stationary and continue with this series to calibrate an ARIMA-SARIMA model.

4.3 ACF and PACF plots

Once we confirm that our series is stationary, then the first step is to do the AC and PAC plots to identify how many AR and MA terms we can include in our ARIMA/SARIMA model.

Remember that we can compute the autocorrelation plots with the acf2 function from the astsa package. Let’s examine the autocorrelations of our series. Remember that our series is the seasonal difference of the log of air passengers, which is the annual % growth of air passengers month by month.

library(astsa)
# I create a ts object from the xts; the acf2 works better with ts objects
lnair =ts(lnair.xts$Passengers)
acf2(diff(lnair,lag=12))

##      [,1] [,2]  [,3] [,4] [,5]  [,6]  [,7] [,8]  [,9] [,10] [,11] [,12] [,13]
## ACF  0.71 0.62  0.48 0.44 0.39  0.32  0.24 0.19  0.15 -0.01 -0.11 -0.24 -0.14
## PACF 0.71 0.23 -0.06 0.10 0.05 -0.06 -0.06 0.01 -0.01 -0.29 -0.16 -0.14  0.29
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## ACF  -0.14 -0.10 -0.15 -0.10 -0.11 -0.14 -0.16 -0.11 -0.08
## PACF  0.06  0.05 -0.04  0.13 -0.06 -0.15 -0.02  0.12 -0.18

I used the ts function to transform the dataset into a ts dataset since this type of R object will allow us to easily run and forecast arima-sarima models.

4.3.1 ACF and PACF interpretation

The ACF shows the autocorrelations (AC) between the series and its lags, and the PACF shows the partial autocorrelations (PAC) between the series and its lags.

The VERTICAL lines show the MAGNITUD of the AUTOCORRELATION between a LAG and the current value of the series.

The blue horizontal dotted lines in the plots cover the 95% confidence interval for the autocorrelations. So, if the vertical line croses the blue dotted line, it means that that specific autocorrelation is SIGNIFICANTLY DIFFERENT THAN ZERO (it could be positive or negative).

Now what is the general interpretation of the these plots?

First, we always have to remember what we are modelling. In this case we are modelling the seasonal difference of the log, which is the ANNUAL % GROWTH of passengers month by month.

In the ACF plot I see that the first 8 ACs are positive and significant, and they gradually decay. For example, the AC of LAG 1 is about 70% (almost 0.70). This means that ANNUAL % growth of air passengers is strongly correlated (with a correlation of 70%) with ITS OWN ANNUAL % GROWTH of the PREVIOUS month (its LAG 1).

The AC of LAG 2 is about 62%, so the ANNUAL % GROWTH of passengers is strongly correlated with ITS OWN ANNUAL % GROWTH of 2 MONTHS AGO (its LAG 2)… and so on.

It is important to mention that the ACF plot shows autocorrelations independently of each other.

In the case of PACF, we see PARTIAL autocorrelations, which measure HOW MUCH ELSE the series is correlated with each its own LAGS, AFTER CONSIDERING the effect of LOWER-ORDER LAG AUTOCORRELATIONS.

I see that the PACF shows only the first 2 autocorrelations to be positive AND SINGIFICANT, and the magnitude of the following autocorrelations goes down to zero or negative very quickly.

When the ACF plot shows a SLOW DECLINE of autocorrelations, and the PACF shows a FAST DECLINE of autocorrelatinos, this is a clear PATTERN OF AN AR(p) MODEL. In this case, the # of AR terms is determined by the SIGNIFICANT LAGS in the PACF plot.

We see that also the AC of LAG 10 is statistically significant and negative, but it is recommended first to start with the AR(2) model, and then check the residuals of the model to see whether there is another significant lag.

Then, in this case we start with an ARIMA with p=2 and q=0.

Read Dr. Nau slides to review more PATTERNS of ACF and PACF plots.

4.4 Estimate the ARIMA-SARIMA model

In this case, I can start with the following calibration. Now we can use the log series, and leave the model to automatically calculate the SEASONAL difference. Then, I will use the log of the series (log of air passengers) and define my model as:

arima(p,d,q) sarima(P,D,Q,#periods in the year)

In our example, the values for the parameters are:

arima(2,0,0) sarima(0,1,0,12)

What this model tells us?

  1. You first have to look at the # of periods. In this case, you have 12 periods in a year, so your series should be monthly.

  2. Then you look at d and D (first differences and the seasonal differences). In this case, d=0 and D=1. What does this mean? this means that we are modeling the SEASONAL difference (D=1) of the log of the variable, which is the annual % growth (month by month). Since d=0 this means that we are NOT using the first difference of the series.

  3. Then you look at the p, q, P, Q. p tells you how many AR terms are included in your model. In this case, p=2, so your model includes the first 2 AR terms. Since q, P and Q are zero, then your model does not include any of these type of terms.

How this model can be expressed mathematically?

Since you are modeling the seasonal difference, you can express this model as the following equation:

\(\triangle^{12}lnair=\phi_{0}+\phi_{1}(L^{1}\triangle^{12}lnair_{t})+\phi_{2}(L^{2}\triangle^{12}lnair_{t})+\varepsilon_{t}\)

Using a simpler notation, this equation can be expressed as:

\(s12.lnair_{t}=\phi_{0}+\phi_{1}L1.s12.lnair_{t}+\phi_{2}L2.s12.lnair_{t}+\varepsilon_{t}\)

Remember that s12.lnair is the seasonal difference of the log of air passengers, which is the annual %growth of air passengers month by month. Then we can read this mathematical expression as: “The annual %growth (month by month) of air passengers at time t can be determined by its own annual % growth of air passengers at time t-1 and t-2 (the %growth of the previous 2 months), and by a random shock”

4.5 Running the ARIMA/SARIMA

Now we can estimate the ARIMA-SARIMA model using the Arima function of the forecast package.

In this case, instead of creating a variable for the seasonal difference of the log (s12.lnair), we will configure the parameters of the Arima function so that the log transformation and the seasonal difference will be calculated before estimating the arima coefficients.

In our first calibration we discussed above, we defined the following values of the ARIMA-SARIMA model for the log of the number of air passengers:

arima(2,0,0) sarima(0,1,0,12)

This means that D=1, so R will calculate the seasonal difference, and then estimate the arima model and include 2 AR terms and no MA terms.

We can run this model in R as follows:

m1 <- Arima(air.xts$Passengers, # We use the Passengers column of the dataset
        order = c(2,0,0), # we indicate that p=0; d=0; q=0 
        seasonal = list(order=c(0,1,0),period=12), #P=0; D=1; Q=0, #periods=12
        include.constant = TRUE, # Here we indicate to include the phi0 coefficient
        lambda = 0) # Here we indicate to apply the natural log to the series  
m1
## Series: air.xts$Passengers 
## ARIMA(2,0,0)(0,1,0)[12] with drift 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##          ar1     ar2   drift
##       0.5540  0.2378  0.0096
## s.e.  0.0845  0.0848  0.0014
## 
## sigma^2 estimated as 0.001742:  log likelihood=233.13
## AIC=-458.26   AICc=-457.95   BIC=-446.73

If we set lamda = 0 this means that the the model will use the log of the variable (passengers); this is one of the Box-Jenkins mathematical transformations, and it is the most common for financial and economic series.

We can alternatively use the function sarima from the astsa model to run the same arima model. The sarima function provides more detail in the output, but the Arima function has advantages when we want to get the forecast with the model. We run sarima for the same model as:

m1a<-sarima(log(air.xts$Passengers), p=2, d=0, q=0,P=0,D=1,Q=0,S=12)
## initial  value -2.794390 
## iter   2 value -2.940651
## iter   3 value -3.175869
## iter   4 value -3.186870
## iter   5 value -3.189461
## iter   6 value -3.189605
## iter   7 value -3.189606
## iter   8 value -3.189607
## iter   9 value -3.189607
## iter  10 value -3.189608
## iter  11 value -3.189609
## iter  12 value -3.189609
## iter  12 value -3.189609
## iter  12 value -3.189609
## final  value -3.189609 
## converged
## initial  value -3.184486 
## iter   2 value -3.184596
## iter   3 value -3.184959
## iter   4 value -3.185009
## iter   5 value -3.185059
## iter   6 value -3.185073
## iter   7 value -3.185076
## iter   8 value -3.185076
## iter   8 value -3.185076
## final  value -3.185076 
## converged

m1a
## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1     ar2  constant
##       0.5540  0.2378    0.0096
## s.e.  0.0845  0.0848    0.0014
## 
## sigma^2 estimated as 0.001701:  log likelihood = 233.13,  aic = -458.26
## 
## $degrees_of_freedom
## [1] 129
## 
## $ttable
##          Estimate     SE t.value p.value
## ar1        0.5540 0.0845  6.5597  0.0000
## ar2        0.2378 0.0848  2.8037  0.0058
## constant   0.0096 0.0014  6.8655  0.0000
## 
## $AIC
## [1] -3.204618
## 
## $AICc
## [1] -3.203411
## 
## $BIC
## [1] -3.12398

As we see, we need to take the log first and then specify the model. The result is exactly the same as the Arima result, but with sarima we can see the coefficients, their p-values and other important plots of the model. Also, we can see important plots such as ACF of the errors.

4.6 Interpretation of an ARIMA-SARIMA model

Since the Arima function does not display the p values and t values of the coefficients, we can use the function coeftest from the lmtest library.

You have to install the lmtest library and the load it:

library(lmtest)

Now we can see the coefficients and their corresponding standard error, pvalue and t value. We apply the coeftest function to the m1 model created with Arima:

coeftest(m1)
## 
## z test of coefficients:
## 
##        Estimate Std. Error z value  Pr(>|z|)    
## ar1   0.5540129  0.0844572  6.5597 5.392e-11 ***
## ar2   0.2377939  0.0848135  2.8037  0.005052 ** 
## drift 0.0095854  0.0013962  6.8655 6.625e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The drift coefficient is the phi0 coefficient. phi0 and both lags are positive and SIGNIFICANT. What does this mean? First, I have to remember which variable I am modelling. I am modeling the seasonal difference of log passengers, which is the annual % growth of air passengers month by month. Then, I can interpret this ARIMA-SARIMA model as follows:

The annual % growth of air passengers is positively and significantly related with its annual % growth of air passengers of last month (lag1). In a similar way, the annual % growth of air passengers is positively and significantly related with the annual % growth of two months ago. More specifically, 55.4% of annual growth of last month is passed to this month’s annual % growth, and only 23.78% of the annual growth of 2 months ago is passed to the current annual % growth.

In other words, for each percentual point of annual growth of previous month, it is expected that the current annual growth will grow about 0.55 percent points

It is interesting to see that the sum of phi1 and phi2 (0.55+0.23) MUST be < 1 in order for the series to be stationary. If you include more arima terms, the sum of the phi’s must be less than one. If you include MA terms, the sum of the theta coefficients also must be less than one. In addition, the constant or phi0 is positive and significant, indicating that the annual growth of air passengers has a positive tendency to grow over time.

The mathematical model is: \[ E[s12.lnair] = 0.0096 + 0.554*L1.s12.lnair + 0.2378*L2.s12.lnair \] or \[ s12.lnair = 0.0096 + 0.554*L1.s12.lnair + 0.2378*L2.s12.lnair + error \]

4.7 Autocorrelations of the Model residuals/errors

We need to generate a series for the residuals of the ARIMA-SARIMA model. To do this, get the residuals as follows:

air_res <- m1$residuals
head(air_res)
## Time Series:
## Start = 1 
## End = 432001 
## Frequency = 1.15740740740741e-05 
## [1] 0.004708908 0.004751501 0.004854026 0.004821445 0.004747833 0.004847728

4.8 Test whether the residuals is a white noise series

Test whether the residual series (the air_res variable) is a white noise running the ac and pac plots. A white noise is a series that has no signicant autocorrelations with any of its own lags.

acf2(air_res)

##      [,1] [,2]  [,3] [,4] [,5] [,6]  [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.01 0.01 -0.13 0.02 0.09 0.05 -0.03 0.05 0.20 -0.01 -0.09 -0.40  0.04
## PACF 0.01 0.01 -0.13 0.02 0.10 0.03 -0.03 0.08 0.21 -0.04 -0.10 -0.37  0.01
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## ACF   0.01  0.11 -0.11  0.07  0.04 -0.05 -0.13 -0.03 -0.03
## PACF -0.04  0.01 -0.09  0.16  0.11 -0.08 -0.08  0.16 -0.09

INTERPRET these plots

Can you add another AR or MA term to improve the previous model and then to make the residual a white noise ?

4.9 Forecast air passengers

We re-run the Arima model we run above, but now we will transform the dataset from xts to a ts R object to improve the visualization of the forecast:

#Convert the xts to a ts dataset:
air.ts <- ts(coredata(air.xts$Passengers),start=c(1949,1),frequency=12)
# We display the content of the ts dataset:
air.ts
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
# Run the Arima model:
m1 <- Arima(air.ts, order = c(2,0,0), # p=2:include 2 AR terms; q=0: No MA terms;
       seasonal = list(order=c(0,1,0), # D=1: use seasonal difference; No Seasonal terms
                       period=12),  # periods=12 period in the year
        include.constant = TRUE, # include the phi0 (drift) coefficient
        lambda = 0) # do the log transformation
m1
## Series: air.ts 
## ARIMA(2,0,0)(0,1,0)[12] with drift 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##          ar1     ar2   drift
##       0.5540  0.2378  0.0096
## s.e.  0.0845  0.0848  0.0014
## 
## sigma^2 estimated as 0.001742:  log likelihood=233.13
## AIC=-458.26   AICc=-457.95   BIC=-446.73
# We can see the pvalues of each ARIMA-SARIMA coefficient using the
#   coeftest from the lmtest library:
coeftest(m1)
## 
## z test of coefficients:
## 
##        Estimate Std. Error z value  Pr(>|z|)    
## ar1   0.5540129  0.0844572  6.5597 5.392e-11 ***
## ar2   0.2377939  0.0848135  2.8037  0.005052 ** 
## drift 0.0095854  0.0013962  6.8655 6.625e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The pvalue of the drift, \(\phi_0\), and the pvalues of the AR coefficients, \(\phi_1\) and \(\phi_2\), are very significant (<0.05). Then we keep these terms in the model.

Using the last Arima model, run a forecast for the following 24 months.

We run the forescast using the arima model m1 (the one I run with the Arima function. We indicate to do a forecast for 24 months, specifying h=24:

forecast_air <- forecast(m1, h=24)
# I stored the information of the forecast in forecast_air

# I plot the forecast
autoplot(forecast_air)

The forecast_air R object has both, historical data and forecast data. A very nice feature of the forecast function after using Arima function, is that we DO NOT HAVE TO do transformations to get the original scale of the variable, which is in tons. The Arima function automatically does the log transformation, and the forecast function automatically does the exponential transofrmation to get the forecast in tons!

The forecast_air R object is a list type of object that contains several elements.

attributes(forecast_air)
## $names
##  [1] "method"    "model"     "level"     "mean"      "lower"     "upper"    
##  [7] "x"         "series"    "fitted"    "residuals"
## 
## $class
## [1] "forecast"

The forecast values are in the mean attribute:

forecast_air$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1961 450.5662 424.4918 457.4920 505.5161 519.5149 590.6793 688.5216 672.2711
## 1962 503.1498 474.3893 511.5911 565.5928 581.5127 661.4150 771.2137 753.2077
##           Sep      Oct      Nov      Dec
## 1961 564.5825 513.1307 434.6578 481.9836
## 1962 632.6921 575.1384 487.2573 540.3795

forecast_air also has historical data (x), residuals, the lower and upper limit of the 95% confidence interval of the forecast, etc !

5 Q CHALLENGE - Modelling consumer sales with an ARIMA/SARIMA model

Go to the course site (Material) and download Dr. Nau slides about ARIMA/SARIMA models. Read the slides and focus on his recommendations to select the number of p, d, q, P, D, Q for any ARIMA/SARIMA model. Download the following Dataset:

download.file("http://www.apradie.com/ec2004/salesfab2.xlsx", "salesfabs2.xlsx", mode="wb")

Now, we read the data from the Excel file and construct an xts dataset:

sales_brands <- read_xlsx("salesfabs2.xlsx", sheet = "Sheet1")

#I create an xts object to handle the object and dates in a better way:

# I construct a date variable as a sequence of dates using the month column:
date <- seq(as.Date(sales_brands$month[1]),
                      by="month",
                      length.out = nrow(sales_brands) )
# I delete the first column since it has the month, and I will construct
#   an xts dataset with the index equal to the month
sales_brands = sales_brands[,-1]

# I create the xts object:
sales_brands<- xts(x=sales_brands, frequency = "monthly", order.by=date)

This dataset contains consumer national monthly sales of a Category of products. The sales is in tons (qfab#) and prices (pfab#) and include sales of few products. You have to do the following:

Generate the log variables for price (pfab1) and sales in volume (qfab1) for product 1.

Now you have to work in a model for qfab1, sales in tons of product 1:

  1. Graph the natural log of sales in tons of this product. WHAT DO YOU OBSERVE about the series? does it look stationary or nonstationary? does it look with seasonality?

  2. If the series is not stationary, try whether the seasonal difference of the log looks as stationary (s12.lnqfab1). Do a time series plot, run and INTERPRET the Dicky-Fuller test

  3. According to Dr Nau slides and what you learned in this workshop, calibrate the best ARIMA/SARIMA model.

  4. What is a white noise series? RESPOND WITH YOUR WORDS

  5. Create a variable for the residuals of your model. Run the ACF plot and the PACF plot for this residual. Does the residual look like a white noise?

  6. Why do you think that the residuals of your best model should look like A White noise? RESPOND WITH YOUR OWN WORDS

  7. Create 24 new observations in the future

  8. Create a forecast for the future observations using your last model

  9. (optional) Estimate the Akaike Information Criterion (AIC) of the model. To do this, run the following command right after running your arima model:

Do a quick research about what is the AIC of the model and EXPLAIN WITH YOUR OWN WORDS

6 Reading

Go to the course site (Material) and download Dr. Nau slides about ARIMA/SARIMA models.

7 Quiz 4 and W4 submission

Go to Canvas and respond Quiz 4. You will have 3 attempts. Questions in this Quiz are related to concepts of the readings related to this Workshop.

The grade of this Workshop will be the following:

  • Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions
  • Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
  • Not submitted (0%)

Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.