Homework batch 1

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.4
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.3
## ✔ ggplot2     3.5.1     ✔ fabletools  0.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──

## ✔ forecast  8.20     ✔ expsmooth 2.3 
## ✔ fma       2.5

##

## 
## Attaching package: 'fpp2'

## The following object is masked from 'package:fpp3':
## 
##     insurance

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:fma':
## 
##     pigs

## corrplot 0.92 loaded

Introduction

This assignment is to be submitted via email by the designated group representative and follows the Hyndman Version for the problems assigned in batch #1. The first batch consists of 12 problems, drawn from both KJ and HA sources. The problems to be addressed are as follows:

The tasks outlined in these problems will require a thorough understanding of the materials covered in the specified chapters, ensuring a comprehensive grasp of the subject matter.

Exercise 2.3

Download some monthly Australian retail data from OTexts.org/fpp2/extrafiles/retail.xlsx. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.

Read the data into R
Select one of the time series as follows (but replace the column name with your own chosen column):
Explore your chosen retail time series using the following functions:

autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

The autoplot shows a strong seasonality to the data, as well as an upward trend. Though there is a brief dip from 1990-2000, there is no evidence that this is part of a cycle yet.

The seasonal plot emphasizes the seasonality of the data. Sales start to rise in the fall before spiking sharply between November and December, then falling off after January, obviously coinciding with holiday shopping and sales for Christmas.

Again, the subseries highlights the seasonality of the data, but paints it clearer than the seasonal plot. Though sales rise from September, the floor actually remains the same. The only real difference is in December, which not only has a higher ceiling, but a higher floor as well.

The data is not very readable in this lag series. We can see some negative relationships and some positive relationships, but the amount of graphs, and the fact that this is monthly, make it difficult to discern much.

The decrease in lags highlights the trend, while the scalloped shape shows the seasonality of the sales data.

Exercise 7.1

Consider the pigs series - the number of pigs slaughtered in Victoria each month.

1. Use the ses() function in R to find the optimal values of alpha and \(l_0\) , and generate forecasts for the next four months.

##      time year quarter gilts profit s_per_herdsz production herdsz
## 1 1967.00 1967       1   105  8.075        10.80       2645    703
## 2 1967.25 1967       2   119  7.819         9.16       2540    722
## 3 1967.50 1967       3   119  7.366         9.38       2565    738
## 4 1967.75 1967       4   109  8.113        10.39       2776    747
## 5 1968.00 1968       1   117  7.380         9.44       2725    755
## 6 1968.25 1968       2   135  7.134         8.69       2623    780

## Series: Count 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.3221247 
## 
##   Initial states:
##      l[0]
##  100646.6
## 
##   sigma^2:  87480760
## 
##      AIC     AICc      BIC 
## 13737.10 13737.14 13750.07

## # A fable: 4 x 6 [1M]
## # Key:     Animal, State, .model [1]
##   Animal State    .model    Month             Count  .mean
##   <fct>  <fct>    <chr>     <mth>            <dist>  <dbl>
## 1 Pigs   Victoria ses    2019 Jan N(95187, 8.7e+07) 95187.
## 2 Pigs   Victoria ses    2019 Feb N(95187, 9.7e+07) 95187.
## 3 Pigs   Victoria ses    2019 Mar N(95187, 1.1e+08) 95187.
## 4 Pigs   Victoria ses    2019 Apr N(95187, 1.1e+08) 95187.

1. Compute a 95% prediction interval for the first forecast using y±1.96 s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

## <distribution[2]>
##              Lower              Upper 
##  N(76871, 8.7e+07) N(113502, 8.7e+07)

The 95% prediction interval for the first forecast is from 76871 to 113502.

## <hilo[4]>
## [1] [76854.79, 113518.3]95 [75927.17, 114445.9]95 [75042.22, 115330.9]95
## [4] [74194.54, 116178.6]95

Exercise 7.2

Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter alpha) and level (the initial level \(l_0\)). It should return the forecast of the next observation in the series. Does it give the same forecast as ses?

## [1] 2724.992

Based on the provided forecasts and intervals, the forecast from the custom SES function (2724.992) does not match the forecast (95187) produced by R’s ses() function for January 2019 (represented as “2019 Jan”).

Exercise 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of
alpha and lo. Do you get the same values as the ses() function?

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = p, initial = "simple", alpha = optimal_alpha) 
## 
##   Smoothing parameters:
##     alpha = 0.8338 
## 
##   Initial states:
##     l = 94200 
## 
##   sigma:  10462.3
## Error measures:
##                     ME    RMSE      MAE        MPE     MAPE      MASE
## Training set -1.885116 10462.3 7846.013 -0.8944317 9.093438 0.8469934
##                    ACF1
## Training set -0.3691641
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2019       93322.92 79914.94 106730.9 72817.19 113828.7
## Feb 2019       93322.92 75865.56 110780.3 66624.19 120021.7
## Mar 2019       93322.92 72592.62 114053.2 61618.66 125027.2
## Apr 2019       93322.92 69770.19 116875.6 57302.13 129343.7
## May 2019       93322.92 67251.54 119394.3 53450.19 133195.7
## Jun 2019       93322.92 64955.64 121690.2 49938.91 136706.9
## Jul 2019       93322.92 62832.13 123813.7 46691.28 139954.6
## Aug 2019       93322.92 60847.18 125798.7 43655.56 142990.3
## Sep 2019       93322.92 58976.75 127669.1 40794.98 145850.9
## Oct 2019       93322.92 57203.04 129442.8 38082.34 148563.5

## Optimal alpha (optim()): 0.8338

## Optimal l0 (optim()): 89414.95

These optimal values were obtained to minimize the sum of squared errors (SSE) in the SES model, indicating the parameters that best fit the historical data for forecasting purposes.So, the optimal values obtained from the optim() function (for alpha and \(l_0\)) do not match the values obtained directly from the SES model using R’s ses() function

Exercise 8.8

Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.

Step 1: Use auto.arima() to Find an Appropriate ARIMA Model

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 = 0.03376:  log likelihood = 10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57

Step 2: Check That the Residuals Look Like White Noise

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
## 
## Model df: 1.   Total lags used: 7

Step 3: Plot Forecasts for the Next 10 Periods and Plot forecasts from ARIMA

The ARIMA(0,1,1) model with drift seems to provide a reasonable fit to the austa series, with non-significant autocorrelation in the residuals and relatively low information criteria values.

Exercise 3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

The transformation is unhelpful because, while it can help stabilize the variance, it doesn’t factor in other things like seasonality and trends. Other techniques like differencing or seasonal decomp would be required here before Box-Cox at least.

Exercise 6.2

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle? Clear cycles in the data based on specific time and year
Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
Do the results support the graphical interpretation from part a? Yes. The decomposed plot confirms seasonal fluctuations and a trend cycle
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series? If you’re weighing more recent event heavily then yes, it does make an impact. Predictability goes down if you are weighing most recent data and there is an outlier.

Exercise 8.2

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced. Shows clear trends throughout, indicating non-stationarity and requires differencing Shows slow gradual downward trend, which also suggests non-stationarity and requires differencing Shows significant lag at first but then is non-zero for the rest, which suggests non-stationarity and requires differencing

Excercise 2.1

Question:

Use the help function to explore what the series gold, woolyrnq and gas represent.

Use autoplot() to plot each of these in separate plots.
What is the frequency of each series? Hint: apply the frequency() function.
Use which.max() to spot the outlier in the gold series. Which observation was it?

Answer:

Help function

Let’s use the help function to explore the data series ‘gold’, ‘woolyrnq’ and ‘gas’

This data series represents the daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989.

This data series tracks the Quarterly production of woollen yarn in Australia in tonnes from Mar 1965 to Sep 1994.

This data series tracks Australian monthly gas production from 1956 to 1995.

Data visualization

Here, we will use ‘autoplot’ to plot the time series separately

Observation:

Gold Price: The price of gold generally rises day over day until approximately day 770, where it experiences a significant spike. After this peak, the price trends downward until about day 1000.
Woolen Yarn Production: The production of woolen yarn exhibits a seasonal pattern, characterized by regular fluctuations with many ups and downs.
Gas Production: Gas production shows a seasonal pattern and a noticeable upward trend, particularly after 1970.

Frequency of the each series

## [1] 1

## [1] 4

## [1] 12

To summarize, here is the frequency of the series

gold: Daily data, so the frequency should be around 365.
woolyrnq: Quarterly data, so the frequency should be 4.
gas: Monthly data, so the frequency should be 12.

Outlier detection for the Gold data

## [1] 770

## [1] 593.7

The outlier in the gold series was the spike that occurred on day 770, with the gold price reaching 593.7 US dollars.

Exercise 3.1

Question:

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Load the data set

First, we need to load the data set using the library mlbench

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

This data frame has 214 observations of 10 variables. We can print the first 5 elements of the dataframe.

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Data visualization

To visualize the distributions of predictor variables, we can use boxplot for numerical variables.

Overall Observations:

The RI, K, SI, and Ba show a relatively narrow range across glass types
Some elements like Na, Mg exhibit greater variability.

A correlation heatmap can also be useful to understand the relationships between the predictor variables.

There is a high and positive correlation between elements Ca and RI while Si and RI for instance is negatively correlated

Outiers and Skewness

Based on the boxplots for the Glass dataset, outliers are observed in several predictor variables like RI, Mg, Si, and Ca for Type 2 glass.

Let’s check for skewness using histograms

The predictor Mg is skewed to the left while K, Ba, and Fe are skewed to the right

Classification Model

We can use Logarithmic Transformation for instance to compress the range of values and reduce the impact of outliers.
Similarly, Box-Cox transformation can handle various types of transformations to normalize the distribution.

Exercise 8.1

Question:

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers. Explain the differences among these figures. Do they all indicate that the data are white noise?

Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Answer:

The differences between these figures lie in the lengths of each spike. As the number of random numbers increases (from 36 to 360), spikes in the ACF plots become shorter. The ACF plots display rapid decay and lack of significant autocorrelation patterns, suggesting white noise.
The distances of critical values from the mean of zero in ACF plots reflect the precision of autocorrelation estimates, which is influenced by sample size.

Exercise 8.6

Answer:

Generate the data for AR(1)

Produce a time plot

To illustrate how the plot change as we change phi 1, let’s simulate and plot two scenarios with different values of phi 1

Higher values of phi 1 lead to stronger autocorrelation and smoother patterns

Generate data for MA(1)

Plot the data for MA(1)

Changing theta1 to -0.6

We observe that Positive values of theta1 create a smoother process with more persistence.

Generate and plot ARIMA(1,1)

Generate and plot AR(2)

…

Homework batch 1

Waheeb Algabri, William Berritt, Kossi Akplaka

Introduction

Exercise 2.3

Exercise 7.1

Exercise 7.2

Exercise 7.3

Exercise 8.8

Exercise 3.2

Exercise 6.2

Exercise 8.2

Excercise 2.1

Help function

Data visualization

Frequency of the each series

Outlier detection for the Gold data

Exercise 3.1

Load the data set

Data visualization

Outiers and Skewness

Classification Model

Exercise 8.1

Exercise 8.6

Generate the data for AR(1)

Produce a time plot

Generate data for MA(1)

Plot the data for MA(1)

Generate and plot ARIMA(1,1)

Generate and plot AR(2)