HA 2.1

Use the help function to explore what the series gold, woolyrnq and gas represent.

a) Use autoplot() to plot each of these in separate plots.

b) What is the frequency of each series? Hint: apply the frequency() function.

## [1] 1
## [1] 4
## [1] 12

c) Use which.max() to spot the outlier in the gold series. Which observation was it?

## [1] 770

The outlier is the day the starting price of gold was $770.

HA 2.3

Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.

You can read the data into R with the following script:

a) Select one of the time series as follows (but replace the column name with your own chosen column):

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1982                    63.9  64.0  62.7  65.6  62.6  64.4  66.0  65.3  77.9
## 1983  65.1  62.3  65.7  61.9  63.7  64.9  69.5  67.9  67.5  66.0  67.2  78.4
## 1984  69.2  68.2  70.7  67.4  71.0  67.9  67.4  70.7  68.1  73.5  75.2  81.9
## 1985  74.7  68.7  72.9  73.1  78.3  73.2  78.8  80.7  77.5  86.0  87.4  98.2
## 1986  85.4  76.5  83.7  84.9  91.9  86.7  90.5  91.3  89.6  93.0  92.3 107.4
## 1987  91.2  87.0  88.9  91.4  93.2  91.3  98.5  97.4  97.3  99.8  96.9 111.3
## 1988 100.1  95.0 101.4  96.2  97.8  91.3  94.6  95.4  98.3 101.4 100.7 120.3
## 1989 106.2  97.3 105.4 100.6 103.9 107.4 106.9 109.6 108.8 110.6 113.8 134.4
## 1990 121.4 114.6 123.5 118.8 123.6 127.5 125.9 130.1 116.9 123.3 125.3 136.5
## 1991 121.5 112.3 121.6 120.2 124.1 120.1 124.5 130.2 119.1 127.7 127.0 141.8
## 1992 133.5 125.9 135.1 135.1 137.7 129.5 134.2 134.9 132.7 138.3 134.9 153.8
## 1993 130.4 122.0 125.9 130.6 128.5 126.2 133.9 127.6 127.3 134.3 132.4 156.4
## 1994 145.7 131.7 143.9 137.1 136.1 127.8 130.5 135.7 138.5 153.9 151.6 179.2
## 1995 160.9 157.0 165.5 164.4 160.8 156.9 158.5 169.3 164.3 184.2 192.0 219.4
## 1996 168.0 166.6 170.2 163.3 167.9 174.3 169.1 173.1 164.1 178.5 172.1 188.8
## 1997 165.1 149.5 161.0 165.3 165.1 161.4 201.3 201.0 200.3 207.4 204.7 240.8
## 1998 233.7 207.8 214.0 218.9 219.3 212.4 216.7 211.9 218.6 223.4 209.4 247.8
## 1999 219.7 206.9 218.4 226.4 217.0 215.4 227.7 223.5 228.3 230.6 226.0 284.5
## 2000 185.8 183.1 186.9 176.9 172.2 176.8 169.9 171.9 169.3 153.4 150.9 174.9
## 2001 151.0 142.9 157.5 179.2 172.6 173.0 188.3 191.7 189.3 218.8 218.1 243.0
## 2002 230.5 210.4 216.0 212.6 226.1 213.6 272.8 268.3 272.3 251.0 249.8 289.3
## 2003 241.6 220.6 235.7 216.9 212.1 203.5 232.7 228.2 223.8 240.7 242.7 263.7
## 2004 231.3 215.4 228.6 234.5 221.4 218.5 224.6 228.2 226.1 236.4 248.1 296.5
## 2005 239.3 225.1 247.7 244.5 242.9 238.6 245.9 251.9 246.8 267.2 271.7 316.0
## 2006 287.0 265.7 291.5 268.2 261.7 253.6 264.8 273.6 269.1 289.6 289.6 341.9
## 2007 304.6 288.8 311.3 307.1 304.0 279.5 300.1 308.5 296.2 276.8 272.0 322.6
## 2008 239.7 229.9 241.3 223.8 234.7 221.4 238.0 250.0 243.7 250.6 254.7 320.6
## 2009 236.1 222.5 243.3 241.6 238.4 230.6 237.1 245.2 241.0 266.1 275.6 342.0
## 2010 255.4 215.5 229.6 214.4 224.5 205.6 198.7 203.3 203.2 214.4 201.4 251.8
## 2011 205.8 194.6 200.4 196.2 185.9 183.2 188.5 192.4 188.5 197.6 201.4 257.2
## 2012 188.1 200.1 199.7 214.5 216.2 208.6 204.5 213.9 211.7 230.6 219.5 276.2
## 2013 234.5 226.2 233.4 234.4 243.0 228.7 235.9 247.8 240.2 244.5 232.2 270.9

b) Explore your chosen retail time series using the following functions:

autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

Conclusion: There is an overall positive trend in the sales data, with an increase in the December month. The downward slope in the final autocorrelation function plot also depicts the positive trend.

KJ 3.1

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

RI Na Mg Al Si K Ca Ba Fe Type
Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290 Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000 Min. :0.00000 1:70
1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000 1st Qu.:0.00000 2:76
Median :1.518 Median :13.30 Median :3.480 Median :1.360 Median :72.79 Median :0.5550 Median : 8.600 Median :0.000 Median :0.00000 3:17
Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445 Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175 Mean :0.05701 5:13
3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000 3rd Qu.:0.10000 6: 9
Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500 Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150 Max. :0.51000 7:29

(b) Do there appear to be any outliers in the data? Are any predictors skewed? Magnesium Mg is the only boxplot to not have outliers.

##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523

Barium Ba, Calcium Ca, Iron Fe, Potassium K, Magnesium Mg and RI have high skewness.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

KJ 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Degenerate distributions are defined as having very small variances or 0 variances, so we will check for distributions that match this definition.

## [1] "Degenerate variables are: leaf.mild, mycelium, sclerotia"

b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The above graph captures a clear pattern of missing data related to classes. germ, lodging, seed.tmt, sever and hail have the most missing data.

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

HA 6.2

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

a) Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?

The plastics time series has an increasing trend with a seasonal fluctuation of higher sales in the summer and lower sales in the winter. Based on the annual seasonal plot, the apex of plastics sales exist around August and September.

b) Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

c) Do the results support the graphical interpretation from part a?

Yes. We can see in the above graphs that there is a seasonal component to plastic sales and the trend in increasing.

d) Compute and plot the seasonally adjusted data.

e) Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

The outlier causes the series to be slightly higher than the original series. There are also throughs in the series that weren’t present in the original series.

f) Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Where the outlier lies seems to make no large difference on the series overall, except where the through lies. The throughs occur slightly later for the series with the outlier towards the end.

HA 7.1

Consider the pigs series — the number of pigs slaughtered in Victoria each month.

a) Use the ses() function in R to find the optimal values of \(\alpha\) and \(l_{0}\) and generate forecasts for the next four months.

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665 
## 
## Error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8

b) Compute a 95% prediction interval for the first forecast using \(y \pm 1.96s\) where \(s\) is the standard deviation of the residuals. Compare your interval with the interval produced by R.

We are 95% confident that the true mean lies in the interval (78679.97, 118952.8)

HA 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of \(\alpha\) and \(l_{0}\) .Do you get the same values as the ses() function?

HA 8.1

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

a) Explain the differences among these figures. Do they all indicate that the data are white noise? Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

As the number of numbers increase, the ACF bands become narrower. The data is likely white noise, since there is no discernable pattern in both the bars of the ACF charts and the bars are within the bands.

b) Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

As we increase the number of observations, we become more certain of patterns within the data. Therefore outliers become more apparent and decrease, as observations increase.

HA 8.2

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

APPROACH

In the top plot, we can see that there is a downward trend in IBM’s stock price. The gradual decrease in the ACF implies that this time series is not stationary.

HA 8.6

a) Use R to simulate and plot some data from simple ARIMA models. Use the following R code to generate data from an AR(1) model with \(\phi_1=.6\) and \(\sigma^2=1\). The process starts with \(y_1=0\).

b) Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?

First, we will need to adjust the code to see the effects of changing \(\phi_1\).

We can see that as phi increases, the distance of the time series from 0 increases.

c) Write your own code to generate data from an MA(1) model with \(\theta_1=.6\) and \(\sigma^2=1\).

d) Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?

The effects of changing \(\theta\) are less drastic than changing \(\phi\). As \(\theta\) increases, the disctance from 0 also increases.

e) Generate data from an ARMA(1,1) model with \(\phi_1=.6\), \(\theta_1=.6\), and \(\sigma^2=1\).

f) Generate data from an AR(2) model with \(\phi_1=-.8\), \(\phi_2=.3\), and \(\sigma^2=1\).(Note that these parameters will give a non-stationary series.)

g) Graph the latter two series and compare them.

HA 8.8

Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.

a) Use auto.arima() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 estimated as 0.03376:  log likelihood=10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0008313383 0.1759116 0.1520309 -1.069983 5.513269 0.7461559
##                      ACF1
## Training set -0.000571993

Auto ARIMA chose the model ARIMA(0,1,1) with drift.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
## 
## Model df: 2.   Total lags used: 7

The residuals are fairly normal.

b) Plot forecasts from an ARIMA(0,1,1) model with no drift and compare these to part a. Remove the MA term and plot again.

c) Plot forecasts from an ARIMA(2,1,3) model with drift. Remove the constant and see what happens.

d) Plot forecasts from an ARIMA(0,0,1) model with a constant. Remove the MA term and plot again.

e) Plot forecasts from an ARIMA(0,2,1) model with no constant.