Time Series batch 1 -JZ version

Jie Zou

HA 2.1

Use the help function to explore what the series gold, woolyrnq and gas represent

gold: Daily morning gold prices in US dollars. 1 January 1985 – 31 March 1989.

woolyrnq: Quarterly production of woollen yarn in Australia: tonnes. Mar. 1965 – Sep. 1994.

gas: Australian monthly gas production: 1956–1995.

a

use autoplot() to plot each of these in separate plots.

### b

what is the frequency of each series? hint: apply the frequency() function.

## [1] "Frequency of gold series: 1. Suggest: 365.25 for daily"

## [1] "Frequency of woolyrnq series: 4. Suggest: 4 for quarterly"

## [1] "Frequency of gas series: 12. Suggest: 12 for monthly"

c

use which.max() to spot the outlier in the gold series. which observation was it?

## [1] "The outlier in gold series appears in observation 770"

HA 2.3

download some monthly Australian retail data from the book website. these represent retail sales in various categories for different Australian states, and are stored in a MS Excel file.

a

you can read the data into R with the following script:

retaildata <- readxl::read_excel("retail.xlsx", skip=1)

the second argument(skip=1) is required because the Excel sheet has two header rows.

## # A tibble: 6 × 190
##   `Series ID`         A3349335T A3349627V A3349338X A3349398A A3349468W
##   <dttm>                  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 1982-04-01 00:00:00      303.      41.7      63.9      409.      65.8
## 2 1982-05-01 00:00:00      298.      43.1      64        405.      65.8
## 3 1982-06-01 00:00:00      298       40.3      62.7      401       62.3
## 4 1982-07-01 00:00:00      308.      40.9      65.6      414.      68.2
## 5 1982-08-01 00:00:00      299.      42.1      62.6      404.      66  
## 6 1982-09-01 00:00:00      305.      42        64.4      412.      62.3
## # … with 184 more variables: A3349336V <dbl>, A3349337W <dbl>, A3349397X <dbl>,
## #   A3349399C <dbl>, A3349874C <dbl>, A3349871W <dbl>, A3349790V <dbl>,
## #   A3349556W <dbl>, A3349791W <dbl>, A3349401C <dbl>, A3349873A <dbl>,
## #   A3349872X <dbl>, A3349709X <dbl>, A3349792X <dbl>, A3349789K <dbl>,
## #   A3349555V <dbl>, A3349565X <dbl>, A3349414R <dbl>, A3349799R <dbl>,
## #   A3349642T <dbl>, A3349413L <dbl>, A3349564W <dbl>, A3349416V <dbl>,
## #   A3349643V <dbl>, A3349483V <dbl>, A3349722T <dbl>, A3349727C <dbl>, …

b

select one of the time series as follows(but replace the column name with your own chosen column):

myts <- ts(retaildata[,"A3349873A"],
  frequency=12, start=c(1982,4))

##        Apr   May   Jun   Jul   Aug   Sep
## 1982  91.8 102.6 105.0 106.0  96.9  97.5

c

explore your chosen retail time series using the following functions:autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf(), can you spot any seasonality, cyclicity and trend? what do you learn about the series?

autoplot: there is clear increasing trend

seasonalplot: from Jan to the beginning of Feb, there is increasingly decrease retail sales. It would be the reason which people are in the budget because of the spending in Nov and Dec last year. Then the sales increases slowly until peak in June, it could be summer store up for outdoor activities. Sales drop down slowly afterwards need more investigation. From Nov to Dec with the time closer to recent, every year, people spend more than the year before. It would be the effect of holidays, discounts and annual sales.

subseriesplot: the average purchasing power is monthly stable from Jan to Nov, but it increases in the last month of the year.

lagplot: the relationships on the lag plot are all positive, lag12 has much stronger relationship than others.

autocorrelation: the correlations are significant, the slighyly descrease in the plot as the lag increases is because of the trend, also, the scalloped shape is due to the effect of seasonality.

Conclusion:

there is seasonal patterns
there is no clear cycle patterns
there is increasing trend

HA 6.2

The plastics data set consists of the monthly sales(in thousands) of product A for a plastics manufacturer for five years.

a

plot the time series of sales of productA. can you identify seasonal fluctuations and/or trend-cycle?

The seasonal fluctuation increase from roughly the second month of the year to the eighth month of the year, then it decrease until the next year roughly the end of the first month. There is also an increasing trend.

### b

use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices

### c

do the results support the graphical interpretation from part a?

Yes.

d

compute and plot the seasonally adjusted data

### e

change one observation to be an outlier(e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of outlier?

the outlier is shown in the seasonal adjusted data, it affects the data around the outlier slighly, but it does not seem to affect the points far away from it.

### f

does it make any difference if the outlier is near the end rather than in the middle of the time series?

The outlier near the end only affect the data close to it. However, it seems like the outlier in the middle will affect all of data no matter close or futher away.

KJ 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

the data can be accessed via:

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a

using visualizations, explore the predictor variables to understand their distributions as well as the relationship between predictors

From the distribution plot, There are skewed variables like Ba, Fe and K, and Al, Ca, Na, Ri and Si are roughly normal. Mg seems bimodel.

From the correlation plot, if we determine threshold to be 0.8, only Ca and Ri are strongly correlated. The rest of predictor variables do not seem to have relationships.

## [1] "Suggested high correlated column for deletion is 7"

b

do there appear to be any outliers in the data? are any predictors skewed?

Yes, boxplot helps to show outliers. Predictor variables like Ba, Fe, K and Mg are skewed

c

are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, log transformation is suitable for right skewed variables. Box-Cox transformation can apply to all the variables, it will help to find appropriate transformations.

KJ 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a

investigate the frequency distributions for the categorical predictors. are any of the distributions degenerate in the ways discussed earlier in this chapter?

## [1] "the distributions degenerate in the way discussed in the chapter correspond to variables with column 18"
## [2] "the distributions degenerate in the way discussed in the chapter correspond to variables with column 25"
## [3] "the distributions degenerate in the way discussed in the chapter correspond to variables with column 27"

b

roughly 18% of the data are missing. are there particular predictors that are more likely to be missing? is the pattern of missing data related to the class

As the table shows, variable hail, lodging, seed.tmt and server are more likely to be missing.

## # A tibble: 35 × 2
##    var             n_miss
##    <chr>            <int>
##  1 hail               121
##  2 lodging            121
##  3 seed.tmt           121
##  4 sever              121
##  5 germ               112
##  6 leaf.mild          108
##  7 fruit.spots        106
##  8 fruiting.bodies    106
##  9 seed.discolor      106
## 10 shriveling         106
## # … with 25 more rows

From the summary of the table, there are 5 main classes associated with the missing value.

## # A tibble: 5 × 1
##   Class                      
##   <fct>                      
## 1 2-4-d-injury               
## 2 cyst-nematode              
## 3 diaporthe-pod-&-stem-blight
## 4 herbicide-injury           
## 5 phytophthora-rot

c

develop a strategy for handling missing data, either by eliminating predictors or imputation

Consider the data set only contains 683 observations, row elimination will be performed on variables with only one missing value, the rest of data, I think it would be better to impute the data to ensure the integrity.

HA 7.1

consider the pigs series – the number of pigs slaughtered in Victoria each month.

a

use the ses() function in R to find the optimal values of \(\alpha\) and \(l_0\), and generate forecasts for the next four monthsU

pig_mod <- pigs %>% ses(h = 4)
pig_mod$model

## Simple exponential smoothing 
## 
## Call:
##  ses(y = ., h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665

b

compute a 95% prediction interval for the first forecast using \(\hat{y} \pm\) 1.96\(s\) where \(s\) is the standard deviation of the residuals. compare your interval with interval produced by R

the interval calculated by hand is a little bit smaller than the one produced by R.

## [1] "the interval calculated by hand: (78679.967, 118952.845)"

## [1] "the interval produced by R: (78611.968, 119020.844)"

HA 7.2

write your own function to implement simple exponential smoothing. the function should take arguments y(the time series), alpha (the smoothing parameter \(\alpha\)) and level (the initial level \(l_0\)). it should return the forecast of the next observation in the series. Does it give the same forecasts as ses()?

compare the result from function I’ve created and the result model produce, if I use the exact same parameters and time series, the result is going to be the same.

ses_cast <- function(y, a, l){
  y_hat = 0
  y_obs = length(y)
  for(a_pow in seq(length(y))){
    if(y_obs > 0){
      y_hat = y_hat + a * (1-a)^(a_pow-1) * y[y_obs]
      y_obs = y_obs - 1
    }
  }
  
  return(y_hat + (1-a)^188*l)
}

## [1] "value calculated by created function is 98816.4061115907"

## [1] "value from the model is 98816.4061115907"

HA 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. then use the optim() function to find the optimal values of \(\alpha\) and \(l_0\). Do you get the same value as the ses() function?

The \(\alpha\) is approximate 0.299, initial state \(l_0\) is approximate 76379.27. The ses() function provides \(\alpha\) is 0.2971 and initial state \(l\) is 77260.0561. They are not exactly the same, but pretty close.

sse_cast <- function(pars = c(a, l), y){
  df <-  data.frame(time = integer(length(y)), observation = numeric(length(y)), level = numeric(length(y)), forecasts = numeric(length(y)))
  l0 <-  pars[2]
  alpha <-  pars[1]
  for(i in seq(length(y))){
    df[i, "time"] <- i
    df[i, "observation"] <- y[i]
    df[i, "level"] <- alpha * df[i, "observation"] + (1-alpha)*l0 # calculate the first forecast
    df[i, "forecasts"] <- l0 # initial forecast set to initial state
    l0 <- df[i, "level"] # update initial state to the first forecast
  }
  return(sum((df$observation - df$forecasts)^2))
}

## $par
## [1] 2.990081e-01 7.637927e+04
## 
## $value
## [1] 19767049111
## 
## $counts
## function gradient 
##       91       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

HA 8.1

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

a

explain the differences among these figures. do they all indicate that the data are white noise?

the autocorrelation in x1 is greater than x2, and x2 is greater than x3.

the confidence interval in x1 is wider than x2, and x2 is wider than x3.

they all indicate that the data are white noise because most of spikes are in between the blue dashed line.

b

why are the critical values at different distances from the mean of zero? why are the autocorrelations different in each figure when they each refer to white noise?

critical values at different distances from the mean of zero is because the white noise is expected to close to zero and within \(\pm2\sqrt{T}\) where \(T\) is the length of the series. the length of x1, x2 and x3 are different, the distance will be different.

autocorrelations different in each figure is because the size of data is different. as white noise, they are assumed to be random. the larger data, the smaller autocorrelation, the narrower confidence interval, and vise versa.

HA 8.2

a classic example of non-stationary series is the daily closing IBM stock price series (data set ibmclose). use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

from the overview of IBM daily closing plot, it seems have trend

ACF, the lags show in the plot are greater than the critical value which means that they are not independent

PACF, only the first lag is significant from zero, which means each point correlated to the point right before it.

in conclusion, the series is non-stationary and should be differenced.

HA 8.6

use R to simulate and plot some dara from simple ARIMA models.

a

use the following R code to generate data from an AR(1) model with \(\phi_1\) = 0.6 and \(\sigma^2\) = 1. the process starts with \(y_1\) = 0

set.seed(100)
y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1]+e[i]

b

produce a time plot for the series. how does the plot change as you change \(\phi_1\)?

the greater \(\phi_1\), the smoother the series is.

### c

write your own code to generate data from MA(1) model with \(\theta_1\) = 0.6 and \(\sigma^2\) = 1

theta <- function(t){
  e[1] <- 0
  for(i in 2:100){
    y[i] <- e[i] + t*e[i-1]
  }
  return(y)
}

d

produce a time plot for the series. how does the plot changes as you change \(\theta_1\)?

the greater the value of \(\theta\), the smoother the series is.

### e

generate data from an ARIMA(1,1) model with \(\phi_1\)= -0.8, \(\theta_1\) = 0.6 and \(\sigma^2\) = 1

gen_arima <- function(phi, theta){
  e[1] <- 0
  for(i in 2:100)
    y[i] <- phi*y[i-1]+theta*e[i-1] + e[i]
  return(y)
}

f

generate data from an AR(2) model with \(\phi_1\) = -0.8, \(\phi_2\) = 0.3 and \(\sigma^2\) = 1. (note that these parameters will give a non-stationary series.)

gen_ar <- function(phi1, phi2){
  for(i in 3:100)
    y[i] <- phi1*y[i-1]+phi2*y[i-2] + e[i]
  return(y)
}

g

graph the latter two series and compare them

clearly from the graph, the variance of AR(2) is increasing exponentially. the variance of ARIMA(1,1) seems pretty constant, it could be white noise.

HA 8.8

consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.

a

use auto.arima() to find an appropriate ARIMA model. what model was selected. check that the residuals look like white noise. plot forecasts for the next 10 period.

the auto.arima() function helps select MA(1) model with one time differencing. from the residual plot, all the lags is between the double blue dashed line, which indicates that the residual is white noise. consider the frequency is 1, therefore, it is yearly series. therefore, the next 10 forecasts is shown below.

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 = 0.03376:  log likelihood = 10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
## 
## Model df: 2.   Total lags used: 7

### b

plot forecasts from ARIMA(0,1,1) model with no drift and compare these to part a. remove the MA term and plot again.

compare both plots and pay attention to the scale, the ARIMA(0,1,0) have narrower confidence interval.

### c

plot forecats from an ARIMA(2,1,3) model with drift. remove the constant and see what happen

there is no different in ARIMA(2,1,3) model with constant or without constant

### d

plot forecasts from ARIMA(0,0,1) model with a constant. remove the MA term and plot again

ARIMA(0,0,1) decreases and converges to the mean of past value, while the forecasts of ARIMA(0,0,0) is the naive exponential smoothing where the weight are the same whose forecasts are mean.

### e

plot forecasts from an ARIMA(0,2,1) model with no constant

the confidence interval is increasing with the number periods of forecasts

Arima(austa, order = c(0,2,1), method = "ML", include.constant = FALSE) %>% forecast(h=10) %>% autoplot()