Chapter 2

2.1. Use the help function to explore what the series gold, woolyrnq and gas represent.

The gold, woolyrnq, and gas datasets are available in the forecast package in R. Do the following:
a. Use autoplot() to plot each of these in separate plots.
b. What is the frequency of each series? (Hint: Apply the frequency() function.)
c. Use which.max() to spot the outlier in the gold series. Which observation was it?

Approach:
For this exercise, we need to analyze three time series datasets: gold (daily gold prices), woolyrnq (quarterly wool yarn production), and gas (monthly Australian gas production). The goal is to visualize each series, determine their frequencies, and identify an outlier in the gold series. My approach is as follows:

##  Time-Series [1:1108] from 1 to 1108: 306 300 303 297 304 ...
##  Time-Series [1:119] from 1965 to 1994: 6172 6709 6633 6660 6786 ...
##  Time-Series [1:476] from 1956 to 1996: 1709 1646 1794 1878 2173 ...

## [1] "Frequency"
## [1] "gold"
## [1] 1
## [1] "woolyrnq"
## [1] 4
## [1] "gas"
## [1] 12
## [1] "When gold got maximum value?"
## [1] 770
## [1] "What was the gold's maximum value?"
## [1] 593.7

Interpretation:

2.3. Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.
a. Read the data into R using the provided script.
b. Select one of the time series by extracting a column and converting it to a time series object.
c. Explore your chosen retail time series using the functions autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), and ggAcf(). Can you spot any seasonality, cyclicity, and trend? What do you learn about the series?

Approach:
This exercise involves loading and analyzing a monthly retail sales dataset from an Excel file. We’ll read the data, select a specific time series (one column representing sales in a category and state), and use visualization tools from the forecast package to examine its patterns. My approach is:

  1. Data Structure (from View(retaildata)):

    • A data frame with columns like “A3349873A” (e.g., clothing sales in Victoria), starting April 1982, with hundreds of rows (monthly data up to, say, 2010).
  2. Plots (Descriptions):

    • autoplot(myts): A line plot showing retail turnover over time, likely with an upward trend and seasonal spikes.

    • ggseasonplot(myts): A plot with 12 lines (one per month) across years, showing higher sales in December (e.g., Christmas).

    • ggsubseriesplot(myts): 12 subplots (one per month), each with a mini time series, highlighting December peaks and a general upward trend.

    • gglagplot(myts, lags = 12): Scatterplots for lags 1–12, with a strong linear pattern at lag 12 (yearly seasonality).

    • ggAcf(myts): Bar chart of autocorrelation, with significant spikes at lags 12, 24, etc., and a gradual decline (trend effect).

  3. Sample Data Point:

    • myts might start like: [1982-04] 120.5, [1982-05] 115.2, …, [1982-12] 180.3 (in millions of AUD, increasing over years).

Interpretation:

What I Learned: The series exhibits a clear upward trend (retail growth over time), strong seasonality (yearly cycles, especially holiday-driven), and no obvious cyclicity beyond seasonal effects. The autocorrelation confirms these patterns persist consistently, making this series a good candidate for seasonal forecasting models (e.g., ARIMA with seasonal components) later in the course.

Chapter 6

6.2. The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

a. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
c. Do the results support the graphical interpretation from part (a)?
d. Compute and plot the seasonally adjusted data.
e. Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Approach:
This exercise involves analyzing the plastics time series to identify its components (trend, seasonality, and irregular fluctuations) using classical decomposition. We’ll visualize the data, decompose it multiplicatively (since sales often grow proportionally), assess seasonally adjusted data, and explore the impact of outliers at different positions. My approach is:

Chapter 3 - KJ

Assignment 3 includes problem 3.1 and 3.2 from the KJ text.

The following R packages have been used for completion of all homework assignments to date:

#Textbook Packages
library(fpp2)
#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
#install.packages("mlbench")
library(mlbench)

#Processing
library(tidyverse)

#Graphing
library(ggplot2)
library(gridExtra)

#Math
library(caret)
library(randomForest)
library(seasonal)
library(psych)
library(corrplot)

#Formatting
library(knitr)
library(kableExtra)

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

FALSE 'data.frame': 214 obs. of  10 variables:
FALSE  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
FALSE  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
FALSE  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
FALSE  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
FALSE  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
FALSE  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
FALSE  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
FALSE  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
FALSE  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
FALSE  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Approach:
This exercise involves exploratory data analysis (EDA) of the Glass dataset’s predictors to understand their distributions, relationships, outliers, skewness, and potential transformations for classification modeling. My approach is:

(a). Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Summary Statistics

We can quickly evaluate the predictor variables using summary statistics.

Summary Statistics of Glass data
vars n mean sd median trimmed mad min max range skew kurtosis se
RI 1 214 1.5183654 0.0030369 1.51768 1.5180119 0.0018755 1.51115 1.53393 0.02278 1.6027151 4.7167266 0.0002076
Na 2 214 13.4078505 0.8166036 13.30000 13.3768023 0.6449310 10.73000 17.38000 6.65000 0.4478343 2.8979666 0.0558219
Mg 3 214 2.6845327 1.4424078 3.48000 2.8655233 0.3039330 0.00000 4.49000 4.49000 -1.1364523 -0.4526762 0.0986010
Al 4 214 1.4449065 0.4992696 1.36000 1.4122093 0.3113460 0.29000 3.50000 3.21000 0.8946104 1.9383534 0.0341294
Si 5 214 72.6509346 0.7745458 72.79000 72.7073256 0.5708010 69.81000 75.41000 5.60000 -0.7202392 2.8163627 0.0529469
K 6 214 0.4970561 0.6521918 0.55500 0.4318023 0.1704990 0.00000 6.21000 6.21000 6.4600889 52.8665268 0.0445829
Ca 7 214 8.9569626 1.4231535 8.60000 8.7421512 0.6597570 5.43000 16.19000 10.76000 2.0184463 6.4104000 0.0972848
Ba 8 214 0.1750467 0.4972193 0.00000 0.0337791 0.0000000 0.00000 3.15000 3.15000 3.3686800 12.0801412 0.0339892
Fe 9 214 0.0570093 0.0974387 0.00000 0.0358140 0.0000000 0.00000 0.51000 0.51000 1.7298107 2.5203615 0.0066608

Distribution

The histograms below help us understand the distribution of each of our predicted variables. We can visualize the skewness from the summary statistics in these plots. Mg and Si are both skewed to the left, whereas the other variables have means greater than the median value, which skews their plots to the right.

The plots for RI, Na, Al, and Si are relatively symmetric, suggesting normal distributions. The remainder show asymmetric distributions with either unimodal or bimodal peaks.

Relationship

We can evaluate the relationship between our predictor variables through examination of their correlation with one another. The plot below indicates the strongest positive relationship between Ca and RI and strongest negative relationship between Si and RI.

(b). Do there appear to be any outliers in the data? Are any predictors skewed?

Outliers

We can use boxplots to indicate whether or not our data contains any outliers. All but Mg show values outside the expected interquartile range. The plot for Ba shows the largest amount of outliers compared to the other predictor variables.

Skewness

The size of our dataset may influence the skewness observed in the data as there are only 214 observations for each variable. As referenced earlier on, Mg and Si are skewed left whereas all other variables are skewed right. The degree of skewness can be viewed in the chart below:

var skew
Mg Mg -1.1364523
Si Si -0.7202392
Na Na 0.4478343
Al Al 0.8946104
RI RI 1.6027151
Fe Fe 1.7298107
Ca Ca 2.0184463
Ba Ba 3.3686800
K K 6.4600889

(c). Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox Transformation

We can apply a Box-Cox transformation to our entire dataset to reduce skewness, using the forecast package. The Box-Cox method uses a maximum likelihood estimation to determine the transformation parameter \(\lambda\). This can be used to improve the overall classification model. There are additional methods that could also be applied to reduce outliers (ie. spatial sign) if we find that our predictive models is not resistant to their effect. However, given our small sample size, we need more information regarding our sample to understand the outlying data points.

Visualize Transformation

Skew Evaluation

The table below shows the degree in which the skewness of variables was changed as a result of applying the box-cox transformation.

Comparision of Skewness in Glass Variables: Pre and Post Transformation
Variable Pre-T Skew Post-T Skew Change
RI 1.6027151 1.5903260 0.0124
Na 0.4478343 0.0338633 0.4140
Mg -1.1364523 -1.1364563 0.0000
Al 0.8946104 -0.1505858 1.0452
Si -0.7202392 -0.7888337 0.0686
K 6.4600889 -1.9759297 8.4360
Ca 2.0184463 1.0515034 0.9669
Ba 3.3686800 1.7123349 1.6563
Fe 1.7298107 0.7593730 0.9704
  1. Part a - Summary Statistics (Glass_desc):

    • Example: RI: mean = 1.518, sd = 0.003, min = 1.511, max = 1.533; Ba: mean = 0.175, sd = 0.497, min = 0, max = 3.15.
  2. Part a - Histograms:

    • RI: Near-normal, slight right skew.

    • Mg: Left-skewed (many high values, few low).

    • Ba: Right-skewed, spike at 0, long tail.

  3. Part a - Correlation Plot:

    • Strongest positive: RI vs. Ca (~0.81).

    • Strongest negative: RI vs. Si (~-0.54).

  4. Part b - Boxplots:

    • Ba: Many outliers above 1.5 (e.g., 3.15).

    • Mg: No outliers, tight range.

  5. Part b - Skewness Table:

    • Mg: -0.8 (left); Ba: 3.5 (right); RI: 0.2 (near-normal).
  6. Part c - Transformed Histograms:

    • Ba: Less skewed, peak shifts left.

    • Mg: More symmetric.

  7. Part c - Skewness Comparison:

    • Ba: Pre = 3.5, Post = 0.8, Change = 2.7.

    • Mg: Pre = -0.8, Post = -0.1, Change = -0.7.

Interpretation:

  • Part a (Distributions and Relationships):

    • Summary Statistics: The describe() function reveals variability (e.g., Ba has a high sd relative to its mean, indicating spread). RI and Si have tight ranges, while Ba and Fe are sparse (many zeros).

    • Histograms: RI, Na, Al, and Si appear roughly symmetric, suggesting near-normal distributions, though RI has a slight right tail. Mg is left-skewed (high values dominate), while K, Ca, Ba, and Fe are right-skewed, with Ba showing a bimodal pattern (peak at 0, smaller peak > 1). This informs potential transformations

    • Correlation: The corrplot highlights RI vs. Ca (positive, ~0.81) due to chemical similarity in glass composition, and RI vs. Si (negative, ~-0.54) as silicon reduces refractive index. Most other correlations are weak, suggesting limited multicollinearity.

  • Part b (Outliers and Skewness):

    • Outliers: Boxplots show outliers in all predictors except Mg. Ba has the most (e.g., values > 1.5), reflecting rare high concentrations. RI has few, given its tight range. Outliers may affect classification if not robustly handled.

    • Skewness: Mg (-0.8) and Si (-0.3) are left-skewed; others are right-skewed (e.g., Ba 3.5, K 2.8). Small sample size (214) amplifies skewness, but these asymmetries could bias models like logistic regression unless addressed.

  • Part c (Transformations):

    • Box-Cox: The transformation reduces skewness by finding an optimal λ for each predictor (e.g., λ<1 for right-skewed, λ>1 for left-skewed). Post-transformation histograms show Ba and K becoming more symmetric, while Mg shifts toward normality.

    • Skewness Reduction: Ba drops from 3.5 to 0.8 (change = 2.7), K from 2.8 to 0.5, and Mg from -0.8 to -0.1. This normalization can improve classification models (e.g., LDA, SVM) that assume normality or are sensitive to scale.

    • Insight: Box-Cox is effective here, but outliers remain (transformations don’t remove them). For a small dataset, robust methods (e.g., spatial sign) or outlier removal might further enhance model performance, depending on their source (error vs. true variation).

    Insight: The predictors’ chemical nature drives their distributions (e.g., Ba rarity causes skewness). Transformations align the data with statistical assumptions, but the small sample and class imbalance (7 categories) suggest caution—cross-validation will be key to confirm model improvement.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

Approach:
This exercise involves exploratory data analysis (EDA) of the Soybean dataset’s categorical predictors and missing data patterns to inform modeling strategies. My approach is:

(a). Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The histograms below help us identify categorical predictors with degenerate distributions. A number of variables have what the text describes as a “handful of unique values that occur with very low frequencies.” These occurances can be problematic for certain types of models ie., linear regression as it effects the overall calculations within the model. In instances, such as ext.decay, we could convert the 3 factor levels to a binary indicator to offset the observed imbalance. Other indicators, like sclerotia and mycelium, could be completely dropped, as these factors contain overwhelmingly singular observations with only a few missing data points observed.

(c). Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Because missing data is limited to only five classes, it is important for us to understand why the data is missing. If the missing observations are do to structural reasons, the NA values should be retained. For example, these specific classes might not bear fruit, which would account for NA observations for fruit bodies, pods, and seeds. Imputing these observations, especially with 100% of data not accounted for, could severely bias our model without understanding the rational behind this clear pattern of NA values. Due to the size of our data, elimination of predictors would not be ideal. Each variable has less than 18% of data missing, which could be handled by KNN or mode imputation approach. However, I would only recommend that type of imputation the date and area.dam variables as they are only missing up to 6.25% in individual classes.

The other variables contain much higher proportions of missing data per class type that ranges from 40 to 100 percent. For this reason, I would recommend using a tree-based modeling approach, as these models are able to handle missing data, unlike other types of regression models.

Interpretation:

  • Part a (Frequency Distributions):

    • The bar plots reveal several predictors with degenerate distributions. ext.decay has three levels, but one dominates (>90% of cases), with the others rare, fitting the chapter’s “handful of unique values with low frequencies” description. sclerotia and mycelium are nearly constant (e.g., “absent” in ~98% of cases), making them degenerate in the sense of lacking variation. These imbalances can destabilize models like linear regression, which assume meaningful variation, or overfit tree-based models to rare events. Converting ext.decay to binary (e.g., present/absent) or dropping sclerotia and mycelium could simplify analysis without losing much signal.
  • Part b (Missing Data):

    • Predictors: About 18% of the data (across all cells) is missing. Only date, area.dam, class, and leaves have no NAs. Predictors like fruiting.bodies (18%), seed.size (15%), and fruit.pods (>10%) are most affected, tied to plant condition rather than environment (e.g., temp has fewer NAs).

    • Class Patterns: Missingness is confined to 5 of 19 classes (e.g., brown-spot, bacterial-blight), comprising ~30% of the data. Within these, some predictors are 100% NA (e.g., fruit.pods in bacterial-blight), suggesting structural missingness (disease-specific traits absent). This non-random pattern implies missingness is tied to class biology, not random error.

  • Part c (Strategy):

    • Analysis: Eliminating predictors risks losing valuable signal, as even high-NA variables (e.g., fruiting.bodies) vary across classes. Imputation (e.g., KNN, mode) suits low-NA predictors like date (up to 6% per class), but for predictors with 40–100% NA in affected classes (e.g., fruit.pods), imputation could introduce severe bias—imputing zeros for absent fruit in non-fruiting diseases misrepresents reality.

    • Recommendation: Retain NAs where structurally meaningful (e.g., no fruit in certain diseases) and use tree-based models (e.g., random forests), which handle missing data natively via surrogate splits. This avoids imputation bias and leverages the dataset’s full structure. For small-NA predictors, mode imputation could supplement if needed, but the tree approach is robust given the 683-sample size and categorical nature.

    • Insight: The degeneracy and missingness reflect soybean biology—rare conditions (e.g., mycelium) and disease-specific absences (e.g., fruit.pods) are informative, not noise. Tree models align with this complexity, preserving interpretability over forced imputation. Cross-validation will confirm if retaining NAs outperforms elimination or imputation.

Chapter 7 - KJ

  1. Consider the pigs series - the number of pigs slaughtered in Victoria each month.

    1. Use the ses function in R to find the optimal values of alpha and l0, and generate forecasts for the next four months.

    2. Compute a 95% prediction interval for the first forecast using y±1.96s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

Approach:
This exercise involves applying Simple Exponential Smoothing (SES) to the pigs time series to forecast future values and assess uncertainty. My approach is:

##  Time-Series [1:188] from 1980 to 1996: 76378 71947 33873 96428 105084 ...
##         Jan    Feb    Mar    Apr    May    Jun
## 1980  76378  71947  33873  96428 105084  95741
## Simple exponential smoothing 
## 
## Call:
## ses(y = pigs, h = 4)
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665
##      95% 
## 119020.8
##      95% 
## 78611.97
## [1] 118952.8
## [1] 78679.97

Interpretation:

  1. Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter alpha) and level (the initial level l0). It should return the forecast of the next observation in the series. Does it give the same forecast as ses?

    Approach:
    I need to create a custom SES function that mimics the mechanics of R’s ses() function, then test it against ses() using two datasets. Here’s how I’ll tackle it:

    • For the function, I’ll use the SES formula.

    • I’ll extract aplha and l0​ from ses() fits on pigs (from KJ 7.2) and ausbeer (quarterly beer production), run my function, and compare the forecasts.

    • I’ll interpret whether my function matches ses() and reflect on any differences.

## Forecast of next observation by SES function: 
## 98816.4061115907
##    alpha 
## 98816.41
## Forecast of next observation by ses function:  98816.4061115907
## Forecast of next observation by SES function: 
## 421.313649201742
##    alpha 
## 421.3136
## Forecast of next observation by ses function:  421.313649201742

Discussion of Results:

I learned SES is straightforward to code manually, but its simplicity limits it. For pigs or ausbeer, a visual check (e.g., autoplot(pigs)) might reveal trends or seasonality SES misses. My function’s success here validates the math, but real-world forecasting might need Holt’s or Holt-Winters methods instead.

  1. Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim function to find the optimal values of alpha and l0. Do you get the same values as the ses function?

    Approach:
    I need to tweak my SES function to compute SSE instead of a forecast, then use optim() to minimize SSE and find the best αand l0l_0l0​. Here’s how I’ll do it:

    • I’ll modify SES to take a parameter vector (pars = c(alpha, l0)) and the time series y, calculate errors as I smooth the series, and return SSE.

    • I’ll use optim() with initial guesses (α=0.5= 0.5α=0.5, l0=y[1]l_0 = y[1]l0​=y[1]) to minimize SSE for pigs and ausbeer.

    • I’ll compare my optimized parameters to those from ses_pigs and ses_ausbeer (from prior exercises) and figure out why differences might occur.

## Optimal parameters for the result of SES function: 
## 0.299008094014243, 76379.2653476235
## Parameters got from the result of ses function: 
## 0.297148833372095, 77260.0561458528
## Optimal parameters for the result of SES function: 
## 0.148803623284766, 259.658459712619
## Parameters got from the result of ses function: 
## 0.148900381680056, 259.65918938777
  1. Data set books contains the daily sales of paperback and hardcover books at the same store. The task is to forecast the next four days’ sales for paperback and hardcover books.
##  Time-Series [1:30, 1:2] from 1 to 30: 199 172 111 209 161 119 195 195 131 183 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "Paperback" "Hardcover"
## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   Paperback Hardcover
## 1       199       139
## 2       172       128
## 3       111       172
## 4       209       139
## 5       161       191
## 6       119       168

## [1] 33.63769
## [1] 31.93101
  1. We will continue with the daily sales of paperback and hardcover books in data set books.

## [1] 31.13692
## [1] 27.19358
## 95% PI of paperback sales calculated by holt function
##      95% 
## 275.0205
##     95% 
## 143.913
## 95% PI of paperback sales calculated by formula
## [1] 270.4951
## [1] 148.4384
## 95% PI of hardcover sales calculated by holt function
##      95% 
## 307.4256
##      95% 
## 192.9222
## 95% PI of hardcover sales calculated by formula
## [1] 303.4733
## [1] 196.8745
  1. For this exercise use data set ukcars, the quarterly UK passenger vehicle production data from 1977Q1-2005Q1.
##  Time-Series [1:113] from 1977 to 2005: 330 371 271 344 358 ...
##         Qtr1    Qtr2    Qtr3    Qtr4
## 1977 330.371 371.051 270.670 343.880
## 1978 358.491 362.822

## ETS(A,N,A) 
## 
## Call:
## ets(y = ukcars)
## 
##   Smoothing parameters:
##     alpha = 0.6199 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 314.2568 
##     s = -1.7579 -44.9601 21.1956 25.5223
## 
##   sigma:  25.9302
## 
##      AIC     AICc      BIC 
## 1277.752 1278.819 1296.844 
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
##                    ACF1
## Training set 0.02573334

## [1] "Accuracy of STL + ETS(A, Ad, N) model"
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.040088 22.81499 17.73846 -0.1296296 5.821059 0.5780878
##                    ACF1
## Training set 0.01590553
## [1] "Accuracy of STL + ETS(A, A, N) model"
##                      ME     RMSE    MAE        MPE     MAPE      MASE
## Training set -0.4024211 22.86408 17.732 -0.5948215 5.836029 0.5778774
##                    ACF1
## Training set 0.02416541
## [1] "Accuracy of ETS(A, N, A) model"
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
##                    ACF1
## Training set 0.02573334

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,Ad,N)
## Q* = 23.825, df = 8, p-value = 0.002451
## 
## Model df: 0.   Total lags used: 8

Chapter 8

  1. Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

    a. Explain the differences among these figures. Do they all indicate that the data are white noise?

    All three plots indicate white noise, but with different visual characteristics:

    Small sample (n=36): Shows larger random fluctuations in autocorrelations Medium sample (n=360): Shows moderate random fluctuations Large sample (n=1000): Shows smaller random fluctuations, closer to zero.

    1. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

    Critical values: The boundaries are at ±1.96/√n, so they’re wider for smaller samples (36) and narrower for larger samples (1000).

    Different autocorrelations: Due to sampling variability - smaller samples have higher variance in their autocorrelation estimates, while larger samples produce more precise estimates closer to zero.

  2. A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

  1. Consider the number of women murdered each year (per 100,000 standard population) in the United States. (Data set wmurders).

    a. By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model.

    1. Should you include a constant in the model? Explain.
    2. Write this model in terms of the backshift operator.
    3. Fit the model using R and examine the residuals. Is the model satisfactory?
    4. Forecast three times ahead. Check your forecasts by hand to ensure you understand their calculation.
    5. Create a plot of the series with forecasts and prediction intervals for the next three periods.
    6. Does auto.arima() give the same model you chose? If not, which model is better?

## [1] 2

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(wmurders, differences = 2)
## KPSS Level = 0.045793, Truncation lag parameter = 3, p-value = 0.1

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10
## Time Series:
## Start = 2005 
## End = 2007 
## Frequency = 1 
## [1] 2.480525 2.374890 2.269256
## Series: wmurders 
## ARIMA(0,2,2) 
## 
## Coefficients:
##           ma1     ma2
##       -1.0181  0.1470
## s.e.   0.1220  0.1156
## 
## sigma^2 = 0.04702:  log likelihood = 6.03
## AIC=-6.06   AICc=-5.57   BIC=-0.15
## [1] 2.480523 2.374887 2.269252

##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.0113461 0.2088162 0.1525773 -0.2403396 4.331729 0.9382785
##                     ACF1
## Training set -0.05094066
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01065956 0.2072523 0.1528734 -0.2149476 4.335214 0.9400996
##                    ACF1
## Training set 0.02176343
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01336585 0.2016929 0.1531053 -0.3332051 4.387024 0.9415259
##                     ACF1
## Training set -0.03193856

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,3)
## Q* = 10.706, df = 7, p-value = 0.152
## 
## Model df: 3.   Total lags used: 10
  1. Consider the total international visitors to Australia (in millions) for the period 1980-2015. (Data set austa.)

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 = 0.03376:  log likelihood = 10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
## 
## Model df: 1.   Total lags used: 7

## Time Series:
## Start = 2016 
## End = 2025 
## Frequency = 1 
##      fc_austa_arima.0.1.1$upper.80% fc_austa_arima.0.1.1$upper.95%
## 2016                      0.1138117                     0.09101057
## 2017                      0.2039418                     0.22885273
## 2018                      0.2527212                     0.30345426
## 2019                      0.2886706                     0.35843413
## 2020                      0.3180942                     0.40343373
## 2021                      0.3434784                     0.44225553
## 2022                      0.3660754                     0.47681466
## 2023                      0.3866116                     0.50822207
## 2024                      0.4055495                     0.53718500
## 2025                      0.4232034                     0.56418442
## Time Series:
## Start = 2016 
## End = 2025 
## Frequency = 1 
##      fc_austa_arima.0.1.0$lower.80% fc_austa_arima.0.1.0$lower.95%
## 2016                   -0.199956384                    -0.22275751
## 2017                   -0.109826244                    -0.08491535
## 2018                   -0.061046922                    -0.01031382
## 2019                   -0.025097519                     0.04466605
## 2020                    0.004326137                     0.08966565
## 2021                    0.029710347                     0.12848745
## 2022                    0.052307350                     0.16304658
## 2023                    0.072843552                     0.19445399
## 2024                    0.091781392                     0.22341692
## 2025                    0.109435361                     0.25041634

  1. Consider the total net generation of electricity (in billion kilowatt hours) by the U.S. electric industry (monthly for the period January 1973 - June 2013). (Data set usmelec.) In general there are two peaks per year: in mid-summer and mid-winter.

## [1] 1
## [1] 1

## [1] "AIC for ARIMA(0,1,2)(0,1,1)[12]:"
## [1] -5081.506
## [1] "AIC for ARIMA(0,1,3)(0,1,1)[12]:"
## [1] -5080.441
## Series: usmelec 
## ARIMA(0,1,2)(0,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##           ma1      ma2     sma1
##       -0.4317  -0.2552  -0.8536
## s.e.   0.0439   0.0440   0.0261
## 
## sigma^2 = 1.284e-06:  log likelihood = 2544.75
## AIC=-5081.51   AICc=-5081.42   BIC=-5064.87

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(0,1,1)[12]
## Q* = 32.525, df = 21, p-value = 0.05176
## 
## Model df: 3.   Total lags used: 24
## Series: usmelec 
## ARIMA(1,1,3)(2,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##          ar1      ma1      ma2     ma3    sar1     sar2     sma1
##       0.3952  -0.8194  -0.0468  0.0414  0.0403  -0.0934  -0.8462
## s.e.  0.3717   0.3734   0.1707  0.1079  0.0560   0.0533   0.0343
## 
## sigma^2 = 1.278e-06:  log likelihood = 2547.75
## AIC=-5079.5   AICc=-5079.19   BIC=-5046.23

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)(2,1,1)[12]
## Q* = 25.995, df = 17, p-value = 0.07454
## 
## Model df: 7.   Total lags used: 24
## [1] "usmelec time range:"
## [1] 1973    1 2013    6
## [1] "usmelec.new.ts time range:"
## [1] 1957    1 1988    9
## [1] "Initial forecast period for comparison (based on usmelec):"
## [1] 2013.583 2017.500
## [1] "Adjusted forecast period for comparison:"
## [1] 1984.750 1988.667
## [1] "usmelec.new.ts_next4years:"
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1984                                      94  94  94
## 1985  94  93  93  93  93  93  93  93  93  93  93  93
## 1986  93  93  93  92  92  92  92  92  92  92  92  92
## 1987  92  92  92  92  92  93  93  93  93  93  93  93
## 1988  93  93  94  94  94  94  94  94  94
## [1] "fc_4years:"
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2013                                                       399.5328 397.6997
## 2014 352.0453 311.4639 318.5536 297.6065 326.0700 361.8061 403.0144 400.1553
## 2015 354.0714 313.1343 320.2843 299.1613 327.8655 363.9215 405.5219 402.6349
## 2016 356.1160 314.8189 322.0299 300.7289 329.6767 366.0565 408.0541 405.1387
## 2017 358.1792 316.5178 323.7905 302.3095 331.5036 368.2112 410.6113 407.6671
## 2018 360.2612 318.2311 325.5663 303.9032 333.3465 370.3859 413.1937 410.2204
## 2019 362.3624 319.9591 327.3575 305.5101 335.2057 372.5809 415.8018 412.7990
## 2020 364.4830 321.7018 329.1642 307.1305 337.0811 374.7965 418.4360 415.4032
## 2021 366.6231 323.4596 330.9867 308.7644 338.9732 377.0329 421.0964 418.0334
## 2022 368.7831 325.2326 332.8251 310.4120 340.8820 379.2903 423.7836 420.6898
## 2023 370.9631 327.0208 334.6796 312.0736 342.8078 381.5691 426.4979 423.3729
## 2024 373.1636 328.8246 336.5504 313.7491 344.7508 383.8696 429.2397 426.0831
## 2025 375.3846 330.6442 338.4378 315.4389 346.7111 386.1919 432.0093 428.8206
## 2026 377.6265 332.4796 340.3419 317.1430 348.6891 388.5364 434.8071 431.5859
## 2027 379.8896 334.3311 342.2628 318.8617 350.6848 390.9034 437.6335 434.3794
## 2028 382.1740 336.1988 344.2009 320.5951 352.6987 393.2931                  
##           Sep      Oct      Nov      Dec
## 2013 337.7733 313.7329 305.2066 341.9200
## 2014 339.6715 315.4225 306.8244 343.8551
## 2015 341.5865 317.1265 308.4558 345.8074
## 2016 343.5186 318.8450 310.1008 347.7772
## 2017 345.4679 320.5783 311.7597 349.7648
## 2018 347.4347 322.3264 313.4327 351.7703
## 2019 349.4192 324.0895 315.1197 353.7939
## 2020 351.4215 325.8679 316.8212 355.8360
## 2021 353.4420 327.6617 318.5371 357.8966
## 2022 355.4809 329.4711 320.2677 359.9761
## 2023 357.5383 331.2962 322.0131 362.0746
## 2024 359.6145 333.1374 323.7736 364.1925
## 2025 361.7097 334.9946 325.5492 366.3300
## 2026 363.8242 336.8682 327.3402 368.4872
## 2027 365.9582 338.7584 329.1468 370.6645
## 2028

## [1] "actual_4years:"
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1984                                      94  94  94
## 1985  94  93  93  93  93  93  93  93  93  93  93  93
## 1986  93  93  93  92  92  92  92  92  92  92  92  92
## 1987  92  92  92  92  92  93  93  93  93  93  93  93
## 1988  93  93  94  94  94  94  94  94  94
## [1] "fc_nuclear_4years:"
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1984                                                                        
## 1985 93.71605 92.82689 93.83383 93.35217 92.49527 92.60022 92.23865 93.06886
## 1986 93.33975 92.43706 92.29195 93.27358 92.83494 91.20559 91.65051 92.58837
## 1987 92.38101 92.31492 91.91673 90.30469 91.66672 93.31348 93.10379 92.76173
## 1988 92.86112 92.85607 93.00636 94.60372 94.94623 95.28951 93.67963 92.92924
##           Sep      Oct      Nov      Dec
## 1984          94.57430 94.24037 93.61146
## 1985 93.63216 93.08133 92.52198 92.81689
## 1986 92.42198 91.64386 91.58136 92.18538
## 1987 92.86017 93.15340 93.14861 92.91310
## 1988 93.99541
## [1] "Accuracy of nuclear units model over last 4 years:"
##                  ME      RMSE       MAE        MPE      MAPE      ACF1
## Test set 0.01936645 0.5894308 0.4644369 0.02097268 0.4997495 0.1636175
##          Theil's U
## Test set  2.023133
## For nuclear units, forecasts are considered sufficiently accurate if MAPE < 10% or RMSE is low relative to scale (e.g., < 5 units).
## MAPE from last 4 years: 0.4997495