Homework 6 Predictive analytics

problem 8.2

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differences.

A stationary time series will look the same at any time interval; while it can have cyclic behavior, there should not be seasonality or a trend.

The ACF plot demonstrated a strong autocorrelation at all lags. However the PACF plot only showed a strong auto-correlation at lag 1, because it removed the effects of correlation between lags.

The following time series plot shows: - A sharp drop in daily closing IBM stock price, indicating a downwards trend - Higher variability between time 200 and time 300 (can be addressed with a log transformation)

Most of the auto-correlations fall within the interval. Additionally, The portmanteau test produces a high p-value, and the unit root test has a test statistic within the range for stationary data.

As illustrated in the residual plot, there is an increasing trend in the residual overtime.

#> 
#>  Box-Ljung test
#> 
#> data:  ibmclose %>% log() %>% diff(1)
#> X-squared = 15.014, df = 10, p-value = 0.1316

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 5 lags. 
#> 
#> Value of test-statistic is: 0.3932 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

Problem 8.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

usnetelec

The plot for usnetelec shows an upward trend, which must be addressed to obtain stationary data.

The ACF and PACF plots tell us that most of the autocorrelation comes from the previous time series value. We can difference at lag = 1 to correct this.

#> 
#>  Box-Ljung test
#> 
#> data:  usnetelec %>% sqrt() %>% diff(1)
#> X-squared = 8.2901, df = 10, p-value = 0.6005

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 3 lags. 
#> 
#> Value of test-statistic is: 0.4656 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

usgdp

The plot for usgdp shows an upward trend, which must be addressed to obtain stationary data. There does not seem to be significant seasonality.

We see that taking a difference at lag 1 and performing a BoxCox transform does not address all the autocorrelation, since:

Some of the autocorrelations fall outside the interval (lag = 1, lag = 12)
The portmanteau test produces a significant p-value
However, the unit root test has a test statistic within the range for stationary data

#> 
#>  Box-Ljung test
#> 
#> data:  usgdp %>% BoxCox(BoxCox.lambda(usgdp)) %>% diff(1)
#> X-squared = 65.525, df = 24, p-value = 1.019e-05

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 4 lags. 
#> 
#> Value of test-statistic is: 0.2013 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

mcopper

The plot for usgdp shows an upward trend, which must be addressed to obtain stationary data. There does not seem to be significant seasonality. There is some changing variability than can be addressed with a BoxCox transform.

We see that taking a difference at lag 1 and performing a log transform does make the time series stationary (but still auto-correlated), since:

Some of the autocorrelations fall outside the interval (lag = 1 particularly)
The portmanteau test produces a significant p-value
However, the unit root test has a test statistic within the range for stationary data

#> 
#>  Box-Ljung test
#> 
#> data:  mcopper %>% log() %>% diff(1)
#> X-squared = 92.239, df = 24, p-value = 6.121e-10

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 6 lags. 
#> 
#> Value of test-statistic is: 0.0425 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

enplanements

The time series plot for enplanements shows strong seasonality, an upwards trend, and changing variability.

#> 
#>  Box-Ljung test
#> 
#> data:  enplanements %>% BoxCox(BoxCox.lambda(enplanements)) %>% diff(12) %>% diff(1)
#> X-squared = 71.176, df = 24, p-value = 1.449e-06

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 5 lags. 
#> 
#> Value of test-statistic is: 0.0424 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

visitors

The time series plot for visitors shows strong seasonality, an upwards trend, and changing variability.

#> 
#>  Box-Ljung test
#> 
#> data:  visitors %>% BoxCox(BoxCox.lambda(visitors)) %>% diff(12) %>% diff(1)
#> X-squared = 121.61, df = 24, p-value = 4.996e-15

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 4 lags. 
#> 
#> Value of test-statistic is: 0.0158 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

Problem 8.5

For your retail data (from Exercise 3 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

Taking a look at this plot, a BoxCox transform will be needed because of the changing variance. Seasonal differencing may also be needed, because this data is highly seasonal. The upwards trend will also require differencing to address.

The season plot shows annual seasonality.

Stationary data for this time series can be obtained applying a boxcox transform, seasonal differencing, and differencing at lag = 12. This produces a unit root test statistic within the range for stationary data.

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 5 lags. 
#> 
#> Value of test-statistic is: 0.0138 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

Problem 8.6

Use R to simulate and plot some data from simple ARIMA models.

a. Use the following R code to generate data from an AR(1) model with ϕ1 = 0.6 and σ^2=1. The process starts with y1=0.

b. Produce a time plot for the series. How does the plot change as you change ϕ1?

As ϕ1 increases, the time plot becomes smoother. Lower values of ϕ1 result in more variability in the time series.

c. Write your own code to generate data from an MA(1) model with θ1=0.6 and σ2=1.

#> Time Series:
#> Start = 1 
#> End = 100 
#> Frequency = 1 
#>   [1]  0.00000000 -1.41515283 -1.96899327 -1.73833663  0.01569599  2.45315150
#>   [7]  1.31261266 -1.07876061 -1.16422045 -1.14567163 -2.73883954  0.07759259
#>  [13]  0.48533560 -0.83524203 -1.33916881 -1.35311750 -0.42593892 -0.54646349
#>  [19]  0.12036786  0.79818564  0.29384302 -0.58102202 -0.68853408  1.13807144
#>  [25]  2.05056112 -0.33730484 -1.64374935 -0.32299189  0.59200989 -0.60366657
#>  [31] -1.51432708 -1.87695408 -1.15936690 -1.24789338 -0.41160055 -1.43050020
#>  [37] -1.09214171  0.23229943  0.93185528  1.18580555  1.17167480  1.53389336
#>  [43]  0.09368677 -1.08514132 -0.08648660  0.69395294 -2.51741432 -2.93263291
#>  [49]  0.99705627  0.32000878  1.32954048  1.95010431  1.35637455  2.25613871
#>  [55]  1.69749178  1.58505568 -0.03517707 -0.50348876 -0.57714220 -0.41448283
#>  [61] -1.37883778 -1.72969399 -1.27857590 -0.32776922  0.75288361 -1.10431382
#>  [67] -0.30114856  1.31064778  0.85334872  1.62877986  0.55111638 -1.40069454
#>  [73] -0.66455969  1.75808748  1.25193567  0.57297627 -0.35177260 -0.96724601
#>  [79] -0.76625886 -0.05215635 -0.83187249  0.79007380  2.73583367  3.07264959
#>  [85] -0.04525298 -1.64342257 -0.21751862  1.58822706  0.72443658 -1.30862929
#>  [91] -0.55914025 -1.79088166 -1.17059677  1.18432663  0.09159504 -0.92106185
#>  [97] -1.56761901 -0.12406979  1.37048819  0.18021037

d. Produce a time plot for the series. How does the plot change as you change θ1?

As θ1 changes, the pattern of the time series remains consistant. The scale of the time series values increases with θ1.

e. Generate data from an ARMA(1,1) model with ϕ1=0.6, θ1=0.6 and σ2=1.

f. Generate data from an AR(2) model with ϕ1=−0.8, ϕ2=0.3 and σ2=1. (Note that these parameters will give a non-stationary series.)

g. Graph the latter two series and compare them.

The AR(2) series ends up oscillating around a mean of 0 with increasing variance over time. ARMA(1,1) is the better model, although it also has high autocorrelation.

The reason AR(2) is such a poor model is because it does not follow the following constraint: ϕ2 − ϕ1 < 1

Problem 8.7

a. Consider wmurders, the number of women murdered each year (per 100,000 standard population) in the United States. By studying appropriate graphs of the series in R, find an appropriate ARIMA(p, d, q) model for these data.

#> 
#> ####################### 
#> # KPSS Unit Root Test # 
#> ####################### 
#> 
#> Test is of type: mu with 3 lags. 
#> 
#> Value of test-statistic is: 0.4697 
#> 
#> Critical value for a significance level of: 
#>                 10pct  5pct 2.5pct  1pct
#> critical values 0.347 0.463  0.574 0.739

b. Should you include a constant in the model? Explain.

A constant should not be included in this model, because we differenced the function. When d > 0, the Arima function will set the constant equal to 0.

c. Write this model in terms of the backshift operator.

\[(1-\phi_1B)(1-B)y_t = c + \epsilon_t\]

d. Fit the model using R and examine the residuals. Is the model satisfactory?

The residuals for an ARIMA(1, 1, 0) model are checked below. The p-value is not significant and autocorrelations for all lags are acceptable, therefore this model is satisfactory.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(1,1,0)
#> Q* = 13.281, df = 9, p-value = 0.1503
#> 
#> Model df: 1.   Total lags used: 10

e. Forecast three times ahead. Check your forecasts by hand to make sure that you know how they have been calculated.

#> Series: wmurders 
#> ARIMA(1,1,0) 
#> 
#> Coefficients:
#>           ar1
#>       -0.0841
#> s.e.   0.1346
#> 
#> sigma^2 estimated as 0.04616:  log likelihood=6.92
#> AIC=-9.85   AICc=-9.61   BIC=-5.87

#>      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
#> 2005       2.595506 2.320177 2.870835 2.174427 3.016585
#> 2006       2.594992 2.221624 2.968359 2.023976 3.166007
#> 2007       2.595035 2.143387 3.046682 1.904300 3.285770

#> [1] "forecast 1: 2.5955066804"

#> [1] "forecast 2: 2.59499167887836"

#> [1] "forecast 3: 2.59503499050633"

#> Time Series:
#> Start = 1951 
#> End = 2004 
#> Frequency = 1 
#>  [1] -0.066051  0.010941 -0.078785  0.034196 -0.096699  0.145162 -0.055508
#>  [8]  0.093885  0.081643  0.081254  0.001387  0.048453 -0.047440  0.139087
#> [15]  0.123834  0.300088  0.288182 -0.074626  0.104202  0.184722  0.343738
#> [22] -0.065378  0.565260  0.000579 -0.094974 -0.412076  0.130620 -0.096573
#> [29]  0.182569  0.166352 -0.139356 -0.082214 -0.359198  0.040682 -0.006997
#> [36]  0.204758  0.044996  0.035790 -0.066893  0.074951  0.296692 -0.300084
#> [43]  0.247114 -0.398660 -0.076133 -0.358104 -0.275239 -0.109680 -0.105621
#> [50] -0.229222 -2.663321 -0.506770 -0.135470 -0.072844

f. Create a plot of the series with forecasts and prediction intervals for the next three periods shown.

g. Does auto.arima() give the same model you have chosen? If not, which model do you think is better?

Auto.arima does not give the same model as my selection. My model projects a constant rate of women’s deaths, while the auto arima model projects decreasing rates over the next three years.

Let’s compare the values of AICc (Akaike’s Information Criterion) between the two models. The manual ARIMA model had an AICc of -9.61 while auto.arima() produced an AICc of -6.39. Good models are obtained by minimizing AICc - the manual ARIMA model has the smaller AICc.

#> Series: wmurders 
#> ARIMA(1,1,0) 
#> 
#> Coefficients:
#>           ar1
#>       -0.0841
#> s.e.   0.1346
#> 
#> sigma^2 estimated as 0.04616:  log likelihood=6.92
#> AIC=-9.85   AICc=-9.61   BIC=-5.87

#> Series: wmurders 
#> ARIMA(0,2,3) 
#> 
#> Coefficients:
#>           ma1     ma2      ma3
#>       -1.0154  0.4324  -0.3217
#> s.e.   0.1282  0.2278   0.1737
#> 
#> sigma^2 estimated as 0.04475:  log likelihood=7.77
#> AIC=-7.54   AICc=-6.7   BIC=0.35

Homework 6 Predictive analytics

Salma Elshahawy

2021-03-27

Problem 8.1

problem 8.2

Problem 8.3

Problem 8.5

Problem 8.6

Problem 8.7