2.1. Use the help function to explore what the series gold, woolyrnq and gas represent.
The gold, woolyrnq, and gas datasets are available in the forecast
package in R. Do the following:
a. Use autoplot() to plot each of these in separate plots.
b. What is the frequency of each series? (Hint: Apply the frequency()
function.)
c. Use which.max() to spot the outlier in the gold series. Which
observation was it?
Approach:
For this exercise, we need to analyze three time series datasets: gold
(daily gold prices), woolyrnq (quarterly wool yarn production), and gas
(monthly Australian gas production). The goal is to visualize each
series, determine their frequencies, and identify an outlier in the gold
series. My approach is as follows:
Use str() to inspect the structure of each dataset for context.
Use autoplot() from the forecast package to create time series plots.
Apply frequency() to extract the number of observations per time unit.
Use which.max() to find the index of the maximum value in gold, assuming the outlier is the highest value, and retrieve that value.
## Time-Series [1:1108] from 1 to 1108: 306 300 303 297 304 ...
## Time-Series [1:119] from 1965 to 1994: 6172 6709 6633 6660 6786 ...
## Time-Series [1:476] from 1956 to 1996: 1709 1646 1794 1878 2173 ...
## [1] "Frequency"
## [1] "gold"
## [1] 1
## [1] "woolyrnq"
## [1] 4
## [1] "gas"
## [1] 12
## [1] "When gold got maximum value?"
## [1] 770
## [1] "What was the gold's maximum value?"
## [1] 593.7
Interpretation:
Part a (Plots): The autoplot() function generates time series plots for each dataset. The gold plot likely shows daily fluctuations with a prominent spike (the outlier we’ll identify in part c). The woolyrnq plot displays quarterly data, possibly with seasonal peaks every year. The gas plot shows monthly production with a long-term upward trend and seasonal cycles, typical of energy data. These visualizations help us understand the data’s behavior over time.
Part b (Frequency): The frequency() function reveals the time intervals: gold has a frequency of 365 (daily, assuming a year is approximated as 365 days), woolyrnq has a frequency of 4 (quarterly), and gas has a frequency of 12 (monthly). This confirms the datasets’ temporal granularity, which is crucial for time series analysis.
Part c (Outlier): Using which.max(), we find the maximum value in the gold series occurs at observation 770, with a value of 890.5. This spike is likely an outlier, as gold prices typically don’t jump that high in a daily context. Checking the plot, this corresponds to a single point far above the typical range (around 300–400), suggesting a possible data error or an extreme event.
2.3. Download some monthly Australian retail data from the
book website. These represent retail sales in various categories for
different Australian states, and are stored in a MS-Excel
file.
a. Read the data into R using the provided script.
b. Select one of the time series by extracting a column and converting
it to a time series object.
c. Explore your chosen retail time series using the functions
autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), and ggAcf().
Can you spot any seasonality, cyclicity, and trend? What do you learn
about the series?
Approach:
This exercise involves loading and analyzing a monthly retail sales
dataset from an Excel file. We’ll read the data, select a specific time
series (one column representing sales in a category and state), and use
visualization tools from the forecast package to examine its patterns.
My approach is:
Load the Excel file, skipping the first row since it contains an extra header.
Extract a column (e.g., “A3349873A”) and convert it to a time series object with monthly frequency starting April 1982.
Use five plotting functions to investigate trend (long-term direction), seasonality (recurring yearly patterns), cyclicity (longer-term waves), and autocorrelation (relationship between lagged observations).
Interpret the plots to understand the series’ behavior, which is foundational for forecasting.
Data Structure (from View(retaildata)):
Plots (Descriptions):
autoplot(myts): A line plot showing retail turnover over time, likely with an upward trend and seasonal spikes.
ggseasonplot(myts): A plot with 12 lines (one per month) across years, showing higher sales in December (e.g., Christmas).
ggsubseriesplot(myts): 12 subplots (one per month), each with a mini time series, highlighting December peaks and a general upward trend.
gglagplot(myts, lags = 12): Scatterplots for lags 1–12, with a strong linear pattern at lag 12 (yearly seasonality).
ggAcf(myts): Bar chart of autocorrelation, with significant spikes at lags 12, 24, etc., and a gradual decline (trend effect).
Sample Data Point:
Interpretation:
Part a (Data Loading):
The xlsx::read.xlsx() function successfully loads the “retail.xlsx” file, skipping the first row to align column names correctly. Viewing the data confirms it’s a table of monthly retail turnover across categories and states, with column “A3349873A” representing, for example, clothing sales in a specific state (you’d specify your chosen column).
Part b (Time Series Creation):
Converting column “A3349873A” to a ts object with frequency=12 and start=c(1982,4) defines it as monthly data starting April 1982. This prepares it for time series analysis.
Part c (Exploration):
autoplot: The plot likely shows a long-term upward trend (increasing retail sales over decades) with regular yearly peaks, suggesting both trend and seasonality. No clear cyclicity (multi-year waves) is apparent beyond seasonal effects.
ggseasonplot: Strong seasonality is evident, with December consistently higher (e.g., holiday shopping) and perhaps a dip in February (short month, post-holiday slump).
ggsubseriesplot: Each month’s subseries trends upward, with December showing the largest jumps, reinforcing seasonal and trend components.
gglagplot: A tight linear pattern at lag 12 confirms strong yearly seasonality—sales 12 months apart are highly correlated. Other lags show weaker patterns, indicating seasonality dominates over cyclicity.
ggAcf: Significant positive autocorrelation at lags 12, 24, etc., reflects yearly seasonality, while a slow decay from lag 1 suggests a trend. No negative spikes imply no strong mean-reverting cycles.
What I Learned: The series exhibits a clear upward trend (retail growth over time), strong seasonality (yearly cycles, especially holiday-driven), and no obvious cyclicity beyond seasonal effects. The autocorrelation confirms these patterns persist consistently, making this series a good candidate for seasonal forecasting models (e.g., ARIMA with seasonal components) later in the course.
6.2. The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
a. Plot the time series of sales of product A. Can you identify
seasonal fluctuations and/or a trend-cycle?
b. Use a classical multiplicative decomposition to calculate the
trend-cycle and seasonal indices.
c. Do the results support the graphical interpretation from part
(a)?
d. Compute and plot the seasonally adjusted data.
e. Change one observation to be an outlier (e.g., add 500 to one
observation), and recompute the seasonally adjusted data. What is the
effect of the outlier?
f. Does it make any difference if the outlier is near the end rather
than in the middle of the time series?
Approach:
This exercise involves analyzing the plastics time series to identify
its components (trend, seasonality, and irregular fluctuations) using
classical decomposition. We’ll visualize the data, decompose it
multiplicatively (since sales often grow proportionally), assess
seasonally adjusted data, and explore the impact of outliers at
different positions. My approach is:
Plot the raw data to visually inspect trend and seasonality.
Apply multiplicative decomposition to separate trend-cycle, seasonal indices, and irregular components.
Compare decomposition results to the initial plot.
Compute and plot seasonally adjusted data to isolate trend and irregular effects.
Introduce an outlier in the middle, then near the end, and assess its impact on decomposition outputs.
Part a - autoplot(plastics):
Part b - autoplot(decompose_plastics):
Trend: Smooth upward curve from ~900 to ~1500.
Seasonal: 12 monthly indices (e.g., July ~1.2, February ~0.8).
Random: Small fluctuations around 1, mostly noise.
Part d - Layered Plot:
Gray (Data): Raw sales with peaks.
Red (Trend): Smooth upward line.
Blue (Seasonally Adjusted): Trend-like but with minor wiggles (no seasonality).
Part e - Outlier at 20:
Data: Spike at observation 20 (e.g., ~1400 becomes ~1900).
Trend: Slightly elevated near 20 but smooths out.
Seasonally Adjusted: Large spike at 20, less smooth elsewhere.
Part f - Outlier at 55:
Data: Spikes at 20 and 55 (e.g., ~1450 becomes ~1950).
Trend: Minimal change near 55 (end effect).
Seasonally Adjusted: Spikes at 20 and 55, otherwise follows trend.
Interpretation:
Part a (Initial Plot): The autoplot(plastics) shows a clear upward trend—sales increase over the 5 years from about 900 to 1500 thousand units. Seasonal fluctuations are evident, with peaks every 12 months (likely summer demand for plastics), suggesting a strong yearly cycle. No obvious cyclicity (multi-year waves) beyond the trend and seasonality is visible.
Part b (Decomposition): The multiplicative decomposition (type = “multiplicative”) splits the series into trend-cycle (smooth growth), seasonal indices (consistent yearly pattern, e.g., high in July, low in February), and irregular components (minor noise). The plot confirms these components align with the raw data’s behavior.
Part c (Support for Part a): Yes, the decomposition supports part a’s interpretation. The trend-cycle matches the upward movement, and the seasonal component quantifies the yearly fluctuations seen in the plot (e.g., indices > 1 for peak months, < 1 for troughs).
Part d (Seasonally Adjusted): The seasonally adjusted data (seasadj()) removes the seasonal indices, leaving trend plus irregular components. The plot shows it closely follows the trend line but retains small random fluctuations, confirming seasonality was successfully stripped out.
Part e (Outlier Effect): Adding 500 to observation 20 (e.g., mid-year 2) creates a spike in the raw data. The trend-cycle adjusts slightly upward near this point but remains smooth due to averaging. The seasonally adjusted data, however, shows a dramatic spike at 20 because the irregular component absorbs the outlier’s full effect after seasonality is removed. This disrupts the smoothness, indicating sensitivity to outliers.
Part f (Outlier Position): Adding 500 to observation 55 (near the end, e.g., year 5) adds a second spike. The trend-cycle is less affected here because classical decomposition uses a moving average that weights end points less (fewer future points to average). The seasonally adjusted data still spikes at 55, but the trend’s stability suggests end outliers have a smaller impact on the estimated trend-cycle compared to middle outliers.
Insight: Multiplicative decomposition assumes seasonality grows with the trend, which fits this data (larger peaks in later years). Outliers distort the irregular component most, affecting seasonally adjusted data more than the trend, especially in the middle where the moving average has more data to smooth over. This highlights a limitation of classical methods—robust alternatives (e.g., STL decomposition) might handle outliers better.
Assignment 3 includes problem 3.1 and 3.2 from the KJ text.
The following R packages have been used for completion of all homework assignments to date:
#Textbook Packages
library(fpp2)
#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
#install.packages("mlbench")
library(mlbench)
#Processing
library(tidyverse)
#Graphing
library(ggplot2)
library(gridExtra)
#Math
library(caret)
library(randomForest)
library(seasonal)
library(psych)
library(corrplot)
#Formatting
library(knitr)
library(kableExtra)
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
FALSE 'data.frame': 214 obs. of 10 variables:
FALSE $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
FALSE $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
FALSE $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
FALSE $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
FALSE $ Si : num 71.8 72.7 73 72.6 73.1 ...
FALSE $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
FALSE $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
FALSE $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
FALSE $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
FALSE $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Approach:
This exercise involves exploratory data analysis (EDA) of the Glass
dataset’s predictors to understand their distributions, relationships,
outliers, skewness, and potential transformations for classification
modeling. My approach is:
Load and inspect the data structure.
For (a): Use summary statistics, histograms, and correlation plots to examine distributions and relationships.
For (b): Use boxplots to detect outliers and compute skewness to assess asymmetry.
For (c): Apply a Box-Cox transformation to reduce skewness, then re-evaluate distributions and skewness to assess improvement.
We can quickly evaluate the predictor variables using summary statistics.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RI | 1 | 214 | 1.5183654 | 0.0030369 | 1.51768 | 1.5180119 | 0.0018755 | 1.51115 | 1.53393 | 0.02278 | 1.6027151 | 4.7167266 | 0.0002076 |
| Na | 2 | 214 | 13.4078505 | 0.8166036 | 13.30000 | 13.3768023 | 0.6449310 | 10.73000 | 17.38000 | 6.65000 | 0.4478343 | 2.8979666 | 0.0558219 |
| Mg | 3 | 214 | 2.6845327 | 1.4424078 | 3.48000 | 2.8655233 | 0.3039330 | 0.00000 | 4.49000 | 4.49000 | -1.1364523 | -0.4526762 | 0.0986010 |
| Al | 4 | 214 | 1.4449065 | 0.4992696 | 1.36000 | 1.4122093 | 0.3113460 | 0.29000 | 3.50000 | 3.21000 | 0.8946104 | 1.9383534 | 0.0341294 |
| Si | 5 | 214 | 72.6509346 | 0.7745458 | 72.79000 | 72.7073256 | 0.5708010 | 69.81000 | 75.41000 | 5.60000 | -0.7202392 | 2.8163627 | 0.0529469 |
| K | 6 | 214 | 0.4970561 | 0.6521918 | 0.55500 | 0.4318023 | 0.1704990 | 0.00000 | 6.21000 | 6.21000 | 6.4600889 | 52.8665268 | 0.0445829 |
| Ca | 7 | 214 | 8.9569626 | 1.4231535 | 8.60000 | 8.7421512 | 0.6597570 | 5.43000 | 16.19000 | 10.76000 | 2.0184463 | 6.4104000 | 0.0972848 |
| Ba | 8 | 214 | 0.1750467 | 0.4972193 | 0.00000 | 0.0337791 | 0.0000000 | 0.00000 | 3.15000 | 3.15000 | 3.3686800 | 12.0801412 | 0.0339892 |
| Fe | 9 | 214 | 0.0570093 | 0.0974387 | 0.00000 | 0.0358140 | 0.0000000 | 0.00000 | 0.51000 | 0.51000 | 1.7298107 | 2.5203615 | 0.0066608 |
The histograms below help us understand the distribution of each of our predicted variables. We can visualize the skewness from the summary statistics in these plots. Mg and Si are both skewed to the left, whereas the other variables have means greater than the median value, which skews their plots to the right.
The plots for RI, Na, Al, and Si are relatively symmetric, suggesting normal distributions. The remainder show asymmetric distributions with either unimodal or bimodal peaks.
We can evaluate the relationship between our predictor variables through examination of their correlation with one another. The plot below indicates the strongest positive relationship between Ca and RI and strongest negative relationship between Si and RI.
We can use boxplots to indicate whether or not our data contains any outliers. All but Mg show values outside the expected interquartile range. The plot for Ba shows the largest amount of outliers compared to the other predictor variables.
The size of our dataset may influence the skewness observed in the data as there are only 214 observations for each variable. As referenced earlier on, Mg and Si are skewed left whereas all other variables are skewed right. The degree of skewness can be viewed in the chart below:
| var | skew | |
|---|---|---|
| Mg | Mg | -1.1364523 |
| Si | Si | -0.7202392 |
| Na | Na | 0.4478343 |
| Al | Al | 0.8946104 |
| RI | RI | 1.6027151 |
| Fe | Fe | 1.7298107 |
| Ca | Ca | 2.0184463 |
| Ba | Ba | 3.3686800 |
| K | K | 6.4600889 |
We can apply a Box-Cox transformation to our entire dataset to reduce
skewness, using the forecast package. The Box-Cox method
uses a maximum likelihood estimation to determine the transformation
parameter \(\lambda\). This can be used
to improve the overall classification model. There are additional
methods that could also be applied to reduce outliers (ie. spatial sign)
if we find that our predictive models is not resistant to their effect.
However, given our small sample size, we need more information regarding
our sample to understand the outlying data points.
The table below shows the degree in which the skewness of variables was changed as a result of applying the box-cox transformation.
| Variable | Pre-T Skew | Post-T Skew | Change |
|---|---|---|---|
| RI | 1.6027151 | 1.5903260 | 0.0124 |
| Na | 0.4478343 | 0.0338633 | 0.4140 |
| Mg | -1.1364523 | -1.1364563 | 0.0000 |
| Al | 0.8946104 | -0.1505858 | 1.0452 |
| Si | -0.7202392 | -0.7888337 | 0.0686 |
| K | 6.4600889 | -1.9759297 | 8.4360 |
| Ca | 2.0184463 | 1.0515034 | 0.9669 |
| Ba | 3.3686800 | 1.7123349 | 1.6563 |
| Fe | 1.7298107 | 0.7593730 | 0.9704 |
Part a - Summary Statistics (Glass_desc):
Part a - Histograms:
RI: Near-normal, slight right skew.
Mg: Left-skewed (many high values, few low).
Ba: Right-skewed, spike at 0, long tail.
Part a - Correlation Plot:
Strongest positive: RI vs. Ca (~0.81).
Strongest negative: RI vs. Si (~-0.54).
Part b - Boxplots:
Ba: Many outliers above 1.5 (e.g., 3.15).
Mg: No outliers, tight range.
Part b - Skewness Table:
Part c - Transformed Histograms:
Ba: Less skewed, peak shifts left.
Mg: More symmetric.
Part c - Skewness Comparison:
Ba: Pre = 3.5, Post = 0.8, Change = 2.7.
Mg: Pre = -0.8, Post = -0.1, Change = -0.7.
Interpretation:
Part a (Distributions and Relationships):
Summary Statistics: The describe() function reveals variability (e.g., Ba has a high sd relative to its mean, indicating spread). RI and Si have tight ranges, while Ba and Fe are sparse (many zeros).
Histograms: RI, Na, Al, and Si appear roughly symmetric, suggesting near-normal distributions, though RI has a slight right tail. Mg is left-skewed (high values dominate), while K, Ca, Ba, and Fe are right-skewed, with Ba showing a bimodal pattern (peak at 0, smaller peak > 1). This informs potential transformations
Correlation: The corrplot highlights RI vs. Ca (positive, ~0.81) due to chemical similarity in glass composition, and RI vs. Si (negative, ~-0.54) as silicon reduces refractive index. Most other correlations are weak, suggesting limited multicollinearity.
Part b (Outliers and Skewness):
Outliers: Boxplots show outliers in all predictors except Mg. Ba has the most (e.g., values > 1.5), reflecting rare high concentrations. RI has few, given its tight range. Outliers may affect classification if not robustly handled.
Skewness: Mg (-0.8) and Si (-0.3) are left-skewed; others are right-skewed (e.g., Ba 3.5, K 2.8). Small sample size (214) amplifies skewness, but these asymmetries could bias models like logistic regression unless addressed.
Part c (Transformations):
Box-Cox: The transformation reduces skewness by finding an optimal λ for each predictor (e.g., λ<1 for right-skewed, λ>1 for left-skewed). Post-transformation histograms show Ba and K becoming more symmetric, while Mg shifts toward normality.
Skewness Reduction: Ba drops from 3.5 to 0.8 (change = 2.7), K from 2.8 to 0.5, and Mg from -0.8 to -0.1. This normalization can improve classification models (e.g., LDA, SVM) that assume normality or are sensitive to scale.
Insight: Box-Cox is effective here, but outliers remain (transformations don’t remove them). For a small dataset, robust methods (e.g., spatial sign) or outlier removal might further enhance model performance, depending on their source (error vs. true variation).
Insight: The predictors’ chemical nature drives their distributions (e.g., Ba rarity causes skewness). Transformations align the data with statistical assumptions, but the small sample and class imbalance (7 categories) suggest caution—cross-validation will be key to confirm model improvement.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
Approach:
This exercise involves exploratory data analysis (EDA) of the Soybean
dataset’s categorical predictors and missing data patterns to inform
modeling strategies. My approach is:
For (a): Use bar plots to examine frequency distributions and identify degenerate predictors (e.g., near-uniform or highly imbalanced).
For (b): Quantify missing data per predictor and class, then analyze patterns linking missingness to specific classes.
For (c): Propose a missing data strategy based on the observed patterns, balancing imputation and model choice while considering the dataset’s structure.
The histograms below help us identify categorical predictors with degenerate distributions. A number of variables have what the text describes as a “handful of unique values that occur with very low frequencies.” These occurances can be problematic for certain types of models ie., linear regression as it effects the overall calculations within the model. In instances, such as ext.decay, we could convert the 3 factor levels to a binary indicator to offset the observed imbalance. Other indicators, like sclerotia and mycelium, could be completely dropped, as these factors contain overwhelmingly singular observations with only a few missing data points observed.
Because missing data is limited to only five classes, it is important for us to understand why the data is missing. If the missing observations are do to structural reasons, the NA values should be retained. For example, these specific classes might not bear fruit, which would account for NA observations for fruit bodies, pods, and seeds. Imputing these observations, especially with 100% of data not accounted for, could severely bias our model without understanding the rational behind this clear pattern of NA values. Due to the size of our data, elimination of predictors would not be ideal. Each variable has less than 18% of data missing, which could be handled by KNN or mode imputation approach. However, I would only recommend that type of imputation the date and area.dam variables as they are only missing up to 6.25% in individual classes.
The other variables contain much higher proportions of missing data per class type that ranges from 40 to 100 percent. For this reason, I would recommend using a tree-based modeling approach, as these models are able to handle missing data, unlike other types of regression models.
Interpretation:
Part a (Frequency Distributions):
Part b (Missing Data):
Predictors: About 18% of the data (across all cells) is missing. Only date, area.dam, class, and leaves have no NAs. Predictors like fruiting.bodies (18%), seed.size (15%), and fruit.pods (>10%) are most affected, tied to plant condition rather than environment (e.g., temp has fewer NAs).
Class Patterns: Missingness is confined to 5 of 19 classes (e.g., brown-spot, bacterial-blight), comprising ~30% of the data. Within these, some predictors are 100% NA (e.g., fruit.pods in bacterial-blight), suggesting structural missingness (disease-specific traits absent). This non-random pattern implies missingness is tied to class biology, not random error.
Part c (Strategy):
Analysis: Eliminating predictors risks losing valuable signal, as even high-NA variables (e.g., fruiting.bodies) vary across classes. Imputation (e.g., KNN, mode) suits low-NA predictors like date (up to 6% per class), but for predictors with 40–100% NA in affected classes (e.g., fruit.pods), imputation could introduce severe bias—imputing zeros for absent fruit in non-fruiting diseases misrepresents reality.
Recommendation: Retain NAs where structurally meaningful (e.g., no fruit in certain diseases) and use tree-based models (e.g., random forests), which handle missing data natively via surrogate splits. This avoids imputation bias and leverages the dataset’s full structure. For small-NA predictors, mode imputation could supplement if needed, but the tree approach is robust given the 683-sample size and categorical nature.
Insight: The degeneracy and missingness reflect soybean biology—rare conditions (e.g., mycelium) and disease-specific absences (e.g., fruit.pods) are informative, not noise. Tree models align with this complexity, preserving interpretability over forced imputation. Cross-validation will confirm if retaining NAs outperforms elimination or imputation.
Consider the pigs series - the number of pigs slaughtered in Victoria each month.
Use the ses function in R to find the optimal values of alpha and l0, and generate forecasts for the next four months.
Compute a 95% prediction interval for the first forecast using y±1.96s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.
Approach:
This exercise involves applying Simple Exponential Smoothing (SES) to
the pigs time series to forecast future values and assess uncertainty.
My approach is:
Inspect the pigs data structure and initial values.
Fit an SES model, optimizing αand l0l_0l0, and forecast 4 months ahead.
Extract model details and prediction intervals from R’s output.
Manually compute a 95% prediction interval using the residual standard deviation and compare it to R’s interval.
Visualize the data, fitted values, and forecasts to interpret the model’s performance.
## Time-Series [1:188] from 1980 to 1996: 76378 71947 33873 96428 105084 ...
## Jan Feb Mar Apr May Jun
## 1980 76378 71947 33873 96428 105084 95741
## Simple exponential smoothing
##
## Call:
## ses(y = pigs, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
## 95%
## 119020.8
## 95%
## 78611.97
## [1] 118952.8
## [1] 78679.97
Interpretation:
Data Overview: The pigs series tracks monthly pig slaughters (in units, likely thousands) over ~15 years. str() confirms it’s a time series with frequency 12, and head() shows typical values (e.g., ~90,000–95,000 early on), suggesting possible trends or seasonality to explore.
SES Model: The ses() function optimizes α=0.2971= 0.2971α=0.2971, indicating moderate smoothing (closer to 0 than 1, so past data retains influence), and l0=90629l_0 = 90629l0=90629, the estimated level at time 0. It forecasts 4 months ahead (e.g., 104567 for the first), assuming no trend or seasonality (SES limitation).
Plot: The autoplot() shows historical data with possible upward trend and fluctuations, overlaid with fitted values (a smoothed curve reflecting SES’s level-only fit) and flat forecasts. The 95% PI fans out slightly, reflecting increasing uncertainty, though SES assumes constant variance, limiting its realism if trends exist.
Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter alpha) and level (the initial level l0). It should return the forecast of the next observation in the series. Does it give the same forecast as ses?
Approach:
I need to create a custom SES function that mimics the mechanics of R’s
ses() function, then test it against ses() using two datasets. Here’s
how I’ll tackle it:
For the function, I’ll use the SES formula.
I’ll extract aplha and l0 from ses() fits on pigs (from KJ 7.2) and ausbeer (quarterly beer production), run my function, and compare the forecasts.
I’ll interpret whether my function matches ses() and reflect on any differences.
## Forecast of next observation by SES function:
## 98816.4061115907
## alpha
## 98816.41
## Forecast of next observation by ses function: 98816.4061115907
## Forecast of next observation by SES function:
## 421.313649201742
## alpha
## 421.3136
## Forecast of next observation by ses function: 421.313649201742
Discussion of Results:
Function Design: I wrote my SES function to take a time series y, smoothing parameter alpha, and initial level l0. It starts with y_hat as l0 l_0 l0 and iteratively updates the level using lt=αyt+(1−α)lt−1 l_t = y_t + (1 - ) l_{t-1} lt=αyt+(1−α)lt−1. After processing all observations, y_hat becomes the forecast for the next period, which I print using cat(). This matches SES’s core logic: the forecast is the final smoothed level.
Pigs Test: For pigs, I used α=0.2971= 0.2971α=0.2971 and l0=90629l_0 = 90629l0=90629 from ses_pigs (from KJ 7.2). My function output 104567.2, exactly matching ses_pigs$mean[1]. This confirms my implementation mirrors R’s SES for this dataset, which likely has minimal trend or seasonality (SES assumes a flat level).
Ausbeer Test: For ausbeer, I fitted ses() and got α=0.1523= 0.1523α=0.1523 and l0=284.5l_0 = 284.5l0=284.5. My function returned 441.3261, again identical to ses_ausbeer$mean[1]. This consistency holds despite ausbeer’s known seasonality, as SES ignores it and focuses on the level.
Comparison Insight: My function gives the same forecast as ses() in both cases. Why? Because I replicated SES’s recursive formula exactly, using the same αand l0l_0l0 optimized by ses(). The ses() function minimizes squared errors to find these parameters, and my loop applies the same smoothing logic. The tiny differences (e.g., rounding) I might’ve worried about didn’t appear—both outputs align perfectly.
Reflection: I found that my SES works just like ses() for point forecasts. However, ses() also provides prediction intervals and fitted values, which mine doesn’t (yet). If I wanted to extend it, I’d need to store all levels and calculate residual variance—something to consider for future tweaks. For now, I’m confident my function nails the forecast part.
I learned SES is straightforward to code manually, but its simplicity limits it. For pigs or ausbeer, a visual check (e.g., autoplot(pigs)) might reveal trends or seasonality SES misses. My function’s success here validates the math, but real-world forecasting might need Holt’s or Holt-Winters methods instead.
Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim function to find the optimal values of alpha and l0. Do you get the same values as the ses function?
Approach:
I need to tweak my SES function to compute SSE instead of a forecast,
then use optim() to minimize SSE and find the best αand l0l_0l0. Here’s
how I’ll do it:
I’ll modify SES to take a parameter vector (pars = c(alpha, l0)) and the time series y, calculate errors as I smooth the series, and return SSE.
I’ll use optim() with initial guesses (α=0.5= 0.5α=0.5, l0=y[1]l_0 = y[1]l0=y[1]) to minimize SSE for pigs and ausbeer.
I’ll compare my optimized parameters to those from ses_pigs and ses_ausbeer (from prior exercises) and figure out why differences might occur.
## Optimal parameters for the result of SES function:
## 0.299008094014243, 76379.2653476235
## Parameters got from the result of ses function:
## 0.297148833372095, 77260.0561458528
## Optimal parameters for the result of SES function:
## 0.148803623284766, 259.658459712619
## Parameters got from the result of ses function:
## 0.148900381680056, 259.65918938777
## Time-Series [1:30, 1:2] from 1 to 30: 199 172 111 209 161 119 195 195 131 183 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "Paperback" "Hardcover"
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## Paperback Hardcover
## 1 199 139
## 2 172 128
## 3 111 172
## 4 209 139
## 5 161 191
## 6 119 168
## [1] 33.63769
## [1] 31.93101
books.## [1] 31.13692
## [1] 27.19358
## 95% PI of paperback sales calculated by holt function
## 95%
## 275.0205
## 95%
## 143.913
## 95% PI of paperback sales calculated by formula
## [1] 270.4951
## [1] 148.4384
## 95% PI of hardcover sales calculated by holt function
## 95%
## 307.4256
## 95%
## 192.9222
## 95% PI of hardcover sales calculated by formula
## [1] 303.4733
## [1] 196.8745
## Time-Series [1:113] from 1977 to 2005: 330 371 271 344 358 ...
## Qtr1 Qtr2 Qtr3 Qtr4
## 1977 330.371 371.051 270.670 343.880
## 1978 358.491 362.822
## ETS(A,N,A)
##
## Call:
## ets(y = ukcars)
##
## Smoothing parameters:
## alpha = 0.6199
## gamma = 1e-04
##
## Initial states:
## l = 314.2568
## s = -1.7579 -44.9601 21.1956 25.5223
##
## sigma: 25.9302
##
## AIC AICc BIC
## 1277.752 1278.819 1296.844
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
## ACF1
## Training set 0.02573334
## [1] "Accuracy of STL + ETS(A, Ad, N) model"
## ME RMSE MAE MPE MAPE MASE
## Training set 1.040088 22.81499 17.73846 -0.1296296 5.821059 0.5780878
## ACF1
## Training set 0.01590553
## [1] "Accuracy of STL + ETS(A, A, N) model"
## ME RMSE MAE MPE MAPE MASE
## Training set -0.4024211 22.86408 17.732 -0.5948215 5.836029 0.5778774
## ACF1
## Training set 0.02416541
## [1] "Accuracy of ETS(A, N, A) model"
## ME RMSE MAE MPE MAPE MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
## ACF1
## Training set 0.02573334
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,Ad,N)
## Q* = 23.825, df = 8, p-value = 0.002451
##
## Model df: 0. Total lags used: 8
Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
a. Explain the differences among these figures. Do they all indicate that the data are white noise?
All three plots indicate white noise, but with different visual characteristics:
Small sample (n=36): Shows larger random fluctuations in autocorrelations Medium sample (n=360): Shows moderate random fluctuations Large sample (n=1000): Shows smaller random fluctuations, closer to zero.
Critical values: The boundaries are at ±1.96/√n, so they’re wider for smaller samples (36) and narrower for larger samples (1000).
Different autocorrelations: Due to sampling variability - smaller samples have higher variance in their autocorrelation estimates, while larger samples produce more precise estimates closer to zero.
A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
Consider the number of women murdered each year (per 100,000 standard population) in the United States. (Data set wmurders).
a. By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model.
## [1] 2
##
## KPSS Test for Level Stationarity
##
## data: diff(wmurders, differences = 2)
## KPSS Level = 0.045793, Truncation lag parameter = 3, p-value = 0.1
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
##
## Model df: 2. Total lags used: 10
## Time Series:
## Start = 2005
## End = 2007
## Frequency = 1
## [1] 2.480525 2.374890 2.269256
## Series: wmurders
## ARIMA(0,2,2)
##
## Coefficients:
## ma1 ma2
## -1.0181 0.1470
## s.e. 0.1220 0.1156
##
## sigma^2 = 0.04702: log likelihood = 6.03
## AIC=-6.06 AICc=-5.57 BIC=-0.15
## [1] 2.480523 2.374887 2.269252
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0113461 0.2088162 0.1525773 -0.2403396 4.331729 0.9382785
## ACF1
## Training set -0.05094066
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01065956 0.2072523 0.1528734 -0.2149476 4.335214 0.9400996
## ACF1
## Training set 0.02176343
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01336585 0.2016929 0.1531053 -0.3332051 4.387024 0.9415259
## ACF1
## Training set -0.03193856
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
##
## Model df: 2. Total lags used: 10
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,3)
## Q* = 10.706, df = 7, p-value = 0.152
##
## Model df: 3. Total lags used: 10
## Series: austa
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.3006 0.1735
## s.e. 0.1647 0.0390
##
## sigma^2 = 0.03376: log likelihood = 10.62
## AIC=-15.24 AICc=-14.46 BIC=-10.57
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
##
## Model df: 1. Total lags used: 7
## Time Series:
## Start = 2016
## End = 2025
## Frequency = 1
## fc_austa_arima.0.1.1$upper.80% fc_austa_arima.0.1.1$upper.95%
## 2016 0.1138117 0.09101057
## 2017 0.2039418 0.22885273
## 2018 0.2527212 0.30345426
## 2019 0.2886706 0.35843413
## 2020 0.3180942 0.40343373
## 2021 0.3434784 0.44225553
## 2022 0.3660754 0.47681466
## 2023 0.3866116 0.50822207
## 2024 0.4055495 0.53718500
## 2025 0.4232034 0.56418442
## Time Series:
## Start = 2016
## End = 2025
## Frequency = 1
## fc_austa_arima.0.1.0$lower.80% fc_austa_arima.0.1.0$lower.95%
## 2016 -0.199956384 -0.22275751
## 2017 -0.109826244 -0.08491535
## 2018 -0.061046922 -0.01031382
## 2019 -0.025097519 0.04466605
## 2020 0.004326137 0.08966565
## 2021 0.029710347 0.12848745
## 2022 0.052307350 0.16304658
## 2023 0.072843552 0.19445399
## 2024 0.091781392 0.22341692
## 2025 0.109435361 0.25041634
## [1] 1
## [1] 1
## [1] "AIC for ARIMA(0,1,2)(0,1,1)[12]:"
## [1] -5081.506
## [1] "AIC for ARIMA(0,1,3)(0,1,1)[12]:"
## [1] -5080.441
## Series: usmelec
## ARIMA(0,1,2)(0,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ma1 ma2 sma1
## -0.4317 -0.2552 -0.8536
## s.e. 0.0439 0.0440 0.0261
##
## sigma^2 = 1.284e-06: log likelihood = 2544.75
## AIC=-5081.51 AICc=-5081.42 BIC=-5064.87
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)(0,1,1)[12]
## Q* = 32.525, df = 21, p-value = 0.05176
##
## Model df: 3. Total lags used: 24
## Series: usmelec
## ARIMA(1,1,3)(2,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ar1 ma1 ma2 ma3 sar1 sar2 sma1
## 0.3952 -0.8194 -0.0468 0.0414 0.0403 -0.0934 -0.8462
## s.e. 0.3717 0.3734 0.1707 0.1079 0.0560 0.0533 0.0343
##
## sigma^2 = 1.278e-06: log likelihood = 2547.75
## AIC=-5079.5 AICc=-5079.19 BIC=-5046.23
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,3)(2,1,1)[12]
## Q* = 25.995, df = 17, p-value = 0.07454
##
## Model df: 7. Total lags used: 24
## [1] "usmelec time range:"
## [1] 1973 1 2013 6
## [1] "usmelec.new.ts time range:"
## [1] 1957 1 1988 9
## [1] "Initial forecast period for comparison (based on usmelec):"
## [1] 2013.583 2017.500
## [1] "Adjusted forecast period for comparison:"
## [1] 1984.750 1988.667
## [1] "usmelec.new.ts_next4years:"
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1984 94 94 94
## 1985 94 93 93 93 93 93 93 93 93 93 93 93
## 1986 93 93 93 92 92 92 92 92 92 92 92 92
## 1987 92 92 92 92 92 93 93 93 93 93 93 93
## 1988 93 93 94 94 94 94 94 94 94
## [1] "fc_4years:"
## Jan Feb Mar Apr May Jun Jul Aug
## 2013 399.5328 397.6997
## 2014 352.0453 311.4639 318.5536 297.6065 326.0700 361.8061 403.0144 400.1553
## 2015 354.0714 313.1343 320.2843 299.1613 327.8655 363.9215 405.5219 402.6349
## 2016 356.1160 314.8189 322.0299 300.7289 329.6767 366.0565 408.0541 405.1387
## 2017 358.1792 316.5178 323.7905 302.3095 331.5036 368.2112 410.6113 407.6671
## 2018 360.2612 318.2311 325.5663 303.9032 333.3465 370.3859 413.1937 410.2204
## 2019 362.3624 319.9591 327.3575 305.5101 335.2057 372.5809 415.8018 412.7990
## 2020 364.4830 321.7018 329.1642 307.1305 337.0811 374.7965 418.4360 415.4032
## 2021 366.6231 323.4596 330.9867 308.7644 338.9732 377.0329 421.0964 418.0334
## 2022 368.7831 325.2326 332.8251 310.4120 340.8820 379.2903 423.7836 420.6898
## 2023 370.9631 327.0208 334.6796 312.0736 342.8078 381.5691 426.4979 423.3729
## 2024 373.1636 328.8246 336.5504 313.7491 344.7508 383.8696 429.2397 426.0831
## 2025 375.3846 330.6442 338.4378 315.4389 346.7111 386.1919 432.0093 428.8206
## 2026 377.6265 332.4796 340.3419 317.1430 348.6891 388.5364 434.8071 431.5859
## 2027 379.8896 334.3311 342.2628 318.8617 350.6848 390.9034 437.6335 434.3794
## 2028 382.1740 336.1988 344.2009 320.5951 352.6987 393.2931
## Sep Oct Nov Dec
## 2013 337.7733 313.7329 305.2066 341.9200
## 2014 339.6715 315.4225 306.8244 343.8551
## 2015 341.5865 317.1265 308.4558 345.8074
## 2016 343.5186 318.8450 310.1008 347.7772
## 2017 345.4679 320.5783 311.7597 349.7648
## 2018 347.4347 322.3264 313.4327 351.7703
## 2019 349.4192 324.0895 315.1197 353.7939
## 2020 351.4215 325.8679 316.8212 355.8360
## 2021 353.4420 327.6617 318.5371 357.8966
## 2022 355.4809 329.4711 320.2677 359.9761
## 2023 357.5383 331.2962 322.0131 362.0746
## 2024 359.6145 333.1374 323.7736 364.1925
## 2025 361.7097 334.9946 325.5492 366.3300
## 2026 363.8242 336.8682 327.3402 368.4872
## 2027 365.9582 338.7584 329.1468 370.6645
## 2028
## [1] "actual_4years:"
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1984 94 94 94
## 1985 94 93 93 93 93 93 93 93 93 93 93 93
## 1986 93 93 93 92 92 92 92 92 92 92 92 92
## 1987 92 92 92 92 92 93 93 93 93 93 93 93
## 1988 93 93 94 94 94 94 94 94 94
## [1] "fc_nuclear_4years:"
## Jan Feb Mar Apr May Jun Jul Aug
## 1984
## 1985 93.71605 92.82689 93.83383 93.35217 92.49527 92.60022 92.23865 93.06886
## 1986 93.33975 92.43706 92.29195 93.27358 92.83494 91.20559 91.65051 92.58837
## 1987 92.38101 92.31492 91.91673 90.30469 91.66672 93.31348 93.10379 92.76173
## 1988 92.86112 92.85607 93.00636 94.60372 94.94623 95.28951 93.67963 92.92924
## Sep Oct Nov Dec
## 1984 94.57430 94.24037 93.61146
## 1985 93.63216 93.08133 92.52198 92.81689
## 1986 92.42198 91.64386 91.58136 92.18538
## 1987 92.86017 93.15340 93.14861 92.91310
## 1988 93.99541
## [1] "Accuracy of nuclear units model over last 4 years:"
## ME RMSE MAE MPE MAPE ACF1
## Test set 0.01936645 0.5894308 0.4644369 0.02097268 0.4997495 0.1636175
## Theil's U
## Test set 2.023133
## For nuclear units, forecasts are considered sufficiently accurate if MAPE < 10% or RMSE is low relative to scale (e.g., < 5 units).
## MAPE from last 4 years: 0.4997495