## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.4
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.2
## ✔ lubridate 1.9.3 ✔ fable 0.3.3
## ✔ ggplot2 3.5.1 ✔ fabletools 0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ forecast 8.20 ✔ expsmooth 2.3
## ✔ fma 2.5
##
##
## Attaching package: 'fpp2'
## The following object is masked from 'package:fpp3':
##
## insurance
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:fma':
##
## pigs
## corrplot 0.92 loaded
This assignment is to be submitted via email by the designated group representative and follows the Hyndman Version for the problems assigned in batch #1. The first batch consists of 12 problems, drawn from both KJ and HA sources. The problems to be addressed are as follows:
The tasks outlined in these problems will require a thorough understanding of the materials covered in the specified chapters, ensuring a comprehensive grasp of the subject matter.
Read the data into R
Select one of the time series as follows (but replace the column name with your own chosen column):
Explore your chosen retail time series using the following functions:
autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
The autoplot shows a strong seasonality to the data, as well as an upward trend. Though there is a brief dip from 1990-2000, there is no evidence that this is part of a cycle yet.
The seasonal plot emphasizes the seasonality of the data. Sales start to rise in the fall before spiking sharply between November and December, then falling off after January, obviously coinciding with holiday shopping and sales for Christmas.
Again, the subseries highlights the seasonality of the data, but paints it clearer than the seasonal plot. Though sales rise from September, the floor actually remains the same. The only real difference is in December, which not only has a higher ceiling, but a higher floor as well.
The data is not very readable in this lag series. We can see some negative relationships and some positive relationships, but the amount of graphs, and the fact that this is monthly, make it difficult to discern much.
The decrease in lags highlights the trend, while the scalloped shape shows the seasonality of the sales data.
## time year quarter gilts profit s_per_herdsz production herdsz
## 1 1967.00 1967 1 105 8.075 10.80 2645 703
## 2 1967.25 1967 2 119 7.819 9.16 2540 722
## 3 1967.50 1967 3 119 7.366 9.38 2565 738
## 4 1967.75 1967 4 109 8.113 10.39 2776 747
## 5 1968.00 1968 1 117 7.380 9.44 2725 755
## 6 1968.25 1968 2 135 7.134 8.69 2623 780
## Series: Count
## Model: ETS(A,N,N)
## Smoothing parameters:
## alpha = 0.3221247
##
## Initial states:
## l[0]
## 100646.6
##
## sigma^2: 87480760
##
## AIC AICc BIC
## 13737.10 13737.14 13750.07
## # A fable: 4 x 6 [1M]
## # Key: Animal, State, .model [1]
## Animal State .model Month Count .mean
## <fct> <fct> <chr> <mth> <dist> <dbl>
## 1 Pigs Victoria ses 2019 Jan N(95187, 8.7e+07) 95187.
## 2 Pigs Victoria ses 2019 Feb N(95187, 9.7e+07) 95187.
## 3 Pigs Victoria ses 2019 Mar N(95187, 1.1e+08) 95187.
## 4 Pigs Victoria ses 2019 Apr N(95187, 1.1e+08) 95187.
## <distribution[2]>
## Lower Upper
## N(76871, 8.7e+07) N(113502, 8.7e+07)
The 95% prediction interval for the first forecast is from 76871 to 113502.
## <hilo[4]>
## [1] [76854.79, 113518.3]95 [75927.17, 114445.9]95 [75042.22, 115330.9]95
## [4] [74194.54, 116178.6]95
## [1] 2724.992
Based on the provided forecasts and intervals, the forecast from the custom SES function (2724.992) does not match the forecast (95187) produced by R’s ses() function for January 2019 (represented as “2019 Jan”).
Modify your function from the previous exercise to return the sum of
squared errors rather than the forecast of the next observation. Then
use the optim() function to find the optimal values of
alpha and lo. Do you get the same values as the ses() function?
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = p, initial = "simple", alpha = optimal_alpha)
##
## Smoothing parameters:
## alpha = 0.8338
##
## Initial states:
## l = 94200
##
## sigma: 10462.3
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.885116 10462.3 7846.013 -0.8944317 9.093438 0.8469934
## ACF1
## Training set -0.3691641
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 93322.92 79914.94 106730.9 72817.19 113828.7
## Feb 2019 93322.92 75865.56 110780.3 66624.19 120021.7
## Mar 2019 93322.92 72592.62 114053.2 61618.66 125027.2
## Apr 2019 93322.92 69770.19 116875.6 57302.13 129343.7
## May 2019 93322.92 67251.54 119394.3 53450.19 133195.7
## Jun 2019 93322.92 64955.64 121690.2 49938.91 136706.9
## Jul 2019 93322.92 62832.13 123813.7 46691.28 139954.6
## Aug 2019 93322.92 60847.18 125798.7 43655.56 142990.3
## Sep 2019 93322.92 58976.75 127669.1 40794.98 145850.9
## Oct 2019 93322.92 57203.04 129442.8 38082.34 148563.5
## Optimal alpha (optim()): 0.8338
## Optimal l0 (optim()): 89414.95
These optimal values were obtained to minimize the sum of squared errors (SSE) in the SES model, indicating the parameters that best fit the historical data for forecasting purposes.So, the optimal values obtained from the optim() function (for alpha and \(l_0\)) do not match the values obtained directly from the SES model using R’s ses() function
Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.
Step 1: Use auto.arima() to Find an Appropriate ARIMA Model
## Series: austa
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.3006 0.1735
## s.e. 0.1647 0.0390
##
## sigma^2 = 0.03376: log likelihood = 10.62
## AIC=-15.24 AICc=-14.46 BIC=-10.57
Step 2: Check That the Residuals Look Like White Noise
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
##
## Model df: 1. Total lags used: 7
Step 3: Plot Forecasts for the Next 10 Periods and Plot forecasts from ARIMA
The ARIMA(0,1,1) model with drift seems to provide a reasonable fit to the austa series, with non-significant autocorrelation in the residuals and relatively low information criteria values.
The transformation is unhelpful because, while it can help stabilize the variance, it doesn’t factor in other things like seasonality and trends. Other techniques like differencing or seasonal decomp would be required here before Box-Cox at least.
Plot the time series of sales of product A. Can you identify
seasonal fluctuations and/or a trend-cycle?
Clear cycles in the data based on specific time and year
Use a classical multiplicative decomposition to calculate the
trend-cycle and seasonal indices.
Do the results support the graphical interpretation from part a? Yes. The decomposed plot confirms seasonal fluctuations and a trend cycle
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 500 to one
observation), and recompute the seasonally adjusted data. What is the
effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series? If you’re weighing more recent event heavily then yes, it does make an impact. Predictability goes down if you are weighing most recent data and there is an outlier.
Question:
Use the help function to explore what the series gold, woolyrnq and gas represent.
Answer:
Let’s use the help function to explore the data series ‘gold’, ‘woolyrnq’ and ‘gas’
This data series represents the daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989.
This data series tracks the Quarterly production of woollen yarn in Australia in tonnes from Mar 1965 to Sep 1994.
This data series tracks Australian monthly gas production from 1956 to 1995.
Here, we will use ‘autoplot’ to plot the time series separately
Observation:
Gold Price: The price of gold generally rises day over day until approximately day 770, where it experiences a significant spike. After this peak, the price trends downward until about day 1000.
Woolen Yarn Production: The production of woolen yarn exhibits a seasonal pattern, characterized by regular fluctuations with many ups and downs.
Gas Production: Gas production shows a seasonal pattern and a noticeable upward trend, particularly after 1970.
## [1] 1
## [1] 4
## [1] 12
To summarize, here is the frequency of the series
gold: Daily data, so the frequency should be around 365.
woolyrnq: Quarterly data, so the frequency should be 4.
gas: Monthly data, so the frequency should be 12.
## [1] 770
## [1] 593.7
The outlier in the gold series was the spike that occurred on day 770, with the gold price reaching 593.7 US dollars.
Question:
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
Answer:
First, we need to load the data set using the library mlbench
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
This data frame has 214 observations of 10 variables. We can print the first 5 elements of the dataframe.
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
To visualize the distributions of predictor variables, we can use boxplot for numerical variables.
Overall Observations:
The RI, K, SI, and Ba show a relatively narrow range across glass types
Some elements like Na, Mg exhibit greater variability.
A correlation heatmap can also be useful to understand the relationships between the predictor variables.
There is a high and positive correlation between elements Ca and RI while Si and RI for instance is negatively correlated
Based on the boxplots for the Glass dataset, outliers are observed in several predictor variables like RI, Mg, Si, and Ca for Type 2 glass.
Let’s check for skewness using histograms
The predictor Mg is skewed to the left while K, Ba, and Fe are skewed to the right
We can use Logarithmic Transformation for instance to compress the range of values and reduce the impact of outliers.
Similarly, Box-Cox transformation can handle various types of transformations to normalize the distribution.
Question:
Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers. Explain the differences among these figures. Do they all indicate that the data are white noise?
Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.
Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
Answer:
The differences between these figures lie in the lengths of each spike. As the number of random numbers increases (from 36 to 360), spikes in the ACF plots become shorter. The ACF plots display rapid decay and lack of significant autocorrelation patterns, suggesting white noise.
The distances of critical values from the mean of zero in ACF plots reflect the precision of autocorrelation estimates, which is influenced by sample size.
Answer:
To illustrate how the plot change as we change phi 1, let’s simulate and plot two scenarios with different values of phi 1
Higher values of phi 1 lead to stronger autocorrelation and smoother patterns
Changing theta1 to -0.6
We observe that Positive values of theta1 create a smoother process with more persistence.
…