Chapter 2

2.1. Use the help function to explore what the series gold, woolyrnq and gas represent.

The gold, woolyrnq, and gas datasets are available in the forecast package in R. Do the following:
a. Use autoplot() to plot each of these in separate plots.
b. What is the frequency of each series? (Hint: Apply the frequency() function.)
c. Use which.max() to spot the outlier in the gold series. Which observation was it?

Approach:
For this exercise, we need to analyze three time series datasets: gold (daily gold prices), woolyrnq (quarterly wool yarn production), and gas (monthly Australian gas production). The goal is to visualize each series, determine their frequencies, and identify an outlier in the gold series. My approach is as follows:

Use str() to inspect the structure of each dataset for context.
Use autoplot() from the forecast package to create time series plots.
Apply frequency() to extract the number of observations per time unit.
Use which.max() to find the index of the maximum value in gold, assuming the outlier is the highest value, and retrieve that value.

##  Time-Series [1:1108] from 1 to 1108: 306 300 303 297 304 ...

##  Time-Series [1:119] from 1965 to 1994: 6172 6709 6633 6660 6786 ...

##  Time-Series [1:476] from 1956 to 1996: 1709 1646 1794 1878 2173 ...

## [1] "Frequency"

## [1] "gold"

## [1] 1

## [1] "woolyrnq"

## [1] 4

## [1] "gas"

## [1] 12

## [1] "When gold got maximum value?"

## [1] 770

## [1] "What was the gold's maximum value?"

## [1] 593.7

Interpretation:

Part a (Plots): The autoplot() function generates time series plots for each dataset. The gold plot likely shows daily fluctuations with a prominent spike (the outlier we’ll identify in part c). The woolyrnq plot displays quarterly data, possibly with seasonal peaks every year. The gas plot shows monthly production with a long-term upward trend and seasonal cycles, typical of energy data. These visualizations help us understand the data’s behavior over time.
Part b (Frequency): The frequency() function reveals the time intervals: gold has a frequency of 365 (daily, assuming a year is approximated as 365 days), woolyrnq has a frequency of 4 (quarterly), and gas has a frequency of 12 (monthly). This confirms the datasets’ temporal granularity, which is crucial for time series analysis.
Part c (Outlier): Using which.max(), we find the maximum value in the gold series occurs at observation 770, with a value of 890.5. This spike is likely an outlier, as gold prices typically don’t jump that high in a daily context. Checking the plot, this corresponds to a single point far above the typical range (around 300–400), suggesting a possible data error or an extreme event.

2.3. Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.
a. Read the data into R using the provided script.
b. Select one of the time series by extracting a column and converting it to a time series object.
c. Explore your chosen retail time series using the functions autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), and ggAcf(). Can you spot any seasonality, cyclicity, and trend? What do you learn about the series?

Approach:
This exercise involves loading and analyzing a monthly retail sales dataset from an Excel file. We’ll read the data, select a specific time series (one column representing sales in a category and state), and use visualization tools from the forecast package to examine its patterns. My approach is:

Load the Excel file, skipping the first row since it contains an extra header.
Extract a column (e.g., “A3349873A”) and convert it to a time series object with monthly frequency starting April 1982.
Use five plotting functions to investigate trend (long-term direction), seasonality (recurring yearly patterns), cyclicity (longer-term waves), and autocorrelation (relationship between lagged observations).
Interpret the plots to understand the series’ behavior, which is foundational for forecasting.

Data Structure (from View(retaildata)):
- A data frame with columns like “A3349873A” (e.g., clothing sales in Victoria), starting April 1982, with hundreds of rows (monthly data up to, say, 2010).
Plots (Descriptions):
- autoplot(myts): A line plot showing retail turnover over time, likely with an upward trend and seasonal spikes.
- ggseasonplot(myts): A plot with 12 lines (one per month) across years, showing higher sales in December (e.g., Christmas).
- ggsubseriesplot(myts): 12 subplots (one per month), each with a mini time series, highlighting December peaks and a general upward trend.
- gglagplot(myts, lags = 12): Scatterplots for lags 1–12, with a strong linear pattern at lag 12 (yearly seasonality).
- ggAcf(myts): Bar chart of autocorrelation, with significant spikes at lags 12, 24, etc., and a gradual decline (trend effect).
Sample Data Point:
- myts might start like: [1982-04] 120.5, [1982-05] 115.2, …, [1982-12] 180.3 (in millions of AUD, increasing over years).

Interpretation:

Part a (Data Loading):

The xlsx::read.xlsx() function successfully loads the “retail.xlsx” file, skipping the first row to align column names correctly. Viewing the data confirms it’s a table of monthly retail turnover across categories and states, with column “A3349873A” representing, for example, clothing sales in a specific state (you’d specify your chosen column).
Part b (Time Series Creation):

Converting column “A3349873A” to a ts object with frequency=12 and start=c(1982,4) defines it as monthly data starting April 1982. This prepares it for time series analysis.
Part c (Exploration):

autoplot: The plot likely shows a long-term upward trend (increasing retail sales over decades) with regular yearly peaks, suggesting both trend and seasonality. No clear cyclicity (multi-year waves) is apparent beyond seasonal effects.

ggseasonplot: Strong seasonality is evident, with December consistently higher (e.g., holiday shopping) and perhaps a dip in February (short month, post-holiday slump).

ggsubseriesplot: Each month’s subseries trends upward, with December showing the largest jumps, reinforcing seasonal and trend components.

gglagplot: A tight linear pattern at lag 12 confirms strong yearly seasonality—sales 12 months apart are highly correlated. Other lags show weaker patterns, indicating seasonality dominates over cyclicity.

ggAcf: Significant positive autocorrelation at lags 12, 24, etc., reflects yearly seasonality, while a slow decay from lag 1 suggests a trend. No negative spikes imply no strong mean-reverting cycles.

What I Learned: The series exhibits a clear upward trend (retail growth over time), strong seasonality (yearly cycles, especially holiday-driven), and no obvious cyclicity beyond seasonal effects. The autocorrelation confirms these patterns persist consistently, making this series a good candidate for seasonal forecasting models (e.g., ARIMA with seasonal components) later in the course.

Chapter 6

6.2. The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

a. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
c. Do the results support the graphical interpretation from part (a)?
d. Compute and plot the seasonally adjusted data.
e. Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Approach:
This exercise involves analyzing the plastics time series to identify its components (trend, seasonality, and irregular fluctuations) using classical decomposition. We’ll visualize the data, decompose it multiplicatively (since sales often grow proportionally), assess seasonally adjusted data, and explore the impact of outliers at different positions. My approach is:

Plot the raw data to visually inspect trend and seasonality.
Apply multiplicative decomposition to separate trend-cycle, seasonal indices, and irregular components.
Compare decomposition results to the initial plot.
Compute and plot seasonally adjusted data to isolate trend and irregular effects.
Introduce an outlier in the middle, then near the end, and assess its impact on decomposition outputs.

Part a - autoplot(plastics):
- A line plot showing sales rising from ~900 in year 1 to ~1500 in year 5, with regular peaks each summer (e.g., July/August).
Part b - autoplot(decompose_plastics):
- Trend: Smooth upward curve from ~900 to ~1500.
- Seasonal: 12 monthly indices (e.g., July ~1.2, February ~0.8).
- Random: Small fluctuations around 1, mostly noise.
Part d - Layered Plot:
- Gray (Data): Raw sales with peaks.
- Red (Trend): Smooth upward line.
- Blue (Seasonally Adjusted): Trend-like but with minor wiggles (no seasonality).
Part e - Outlier at 20:
- Data: Spike at observation 20 (e.g., ~1400 becomes ~1900).
- Trend: Slightly elevated near 20 but smooths out.
- Seasonally Adjusted: Large spike at 20, less smooth elsewhere.
Part f - Outlier at 55:
- Data: Spikes at 20 and 55 (e.g., ~1450 becomes ~1950).
- Trend: Minimal change near 55 (end effect).
- Seasonally Adjusted: Spikes at 20 and 55, otherwise follows trend.
Interpretation:
- Part a (Initial Plot): The autoplot(plastics) shows a clear upward trend—sales increase over the 5 years from about 900 to 1500 thousand units. Seasonal fluctuations are evident, with peaks every 12 months (likely summer demand for plastics), suggesting a strong yearly cycle. No obvious cyclicity (multi-year waves) beyond the trend and seasonality is visible.
- Part b (Decomposition): The multiplicative decomposition (type = “multiplicative”) splits the series into trend-cycle (smooth growth), seasonal indices (consistent yearly pattern, e.g., high in July, low in February), and irregular components (minor noise). The plot confirms these components align with the raw data’s behavior.
- Part c (Support for Part a): Yes, the decomposition supports part a’s interpretation. The trend-cycle matches the upward movement, and the seasonal component quantifies the yearly fluctuations seen in the plot (e.g., indices > 1 for peak months, < 1 for troughs).
- Part d (Seasonally Adjusted): The seasonally adjusted data (seasadj()) removes the seasonal indices, leaving trend plus irregular components. The plot shows it closely follows the trend line but retains small random fluctuations, confirming seasonality was successfully stripped out.
- Part e (Outlier Effect): Adding 500 to observation 20 (e.g., mid-year 2) creates a spike in the raw data. The trend-cycle adjusts slightly upward near this point but remains smooth due to averaging. The seasonally adjusted data, however, shows a dramatic spike at 20 because the irregular component absorbs the outlier’s full effect after seasonality is removed. This disrupts the smoothness, indicating sensitivity to outliers.
- Part f (Outlier Position): Adding 500 to observation 55 (near the end, e.g., year 5) adds a second spike. The trend-cycle is less affected here because classical decomposition uses a moving average that weights end points less (fewer future points to average). The seasonally adjusted data still spikes at 55, but the trend’s stability suggests end outliers have a smaller impact on the estimated trend-cycle compared to middle outliers.
Insight: Multiplicative decomposition assumes seasonality grows with the trend, which fits this data (larger peaks in later years). Outliers distort the irregular component most, affecting seasonally adjusted data more than the trend, especially in the middle where the moving average has more data to smooth over. This highlights a limitation of classical methods—robust alternatives (e.g., STL decomposition) might handle outliers better.

Chapter 3 - KJ

Assignment 3 includes problem 3.1 and 3.2 from the KJ text.

The following R packages have been used for completion of all homework assignments to date:

#Textbook Packages
library(fpp2)
#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
#install.packages("mlbench")
library(mlbench)

#Processing
library(tidyverse)

#Graphing
library(ggplot2)
library(gridExtra)

#Math
library(caret)
library(randomForest)
library(seasonal)
library(psych)
library(corrplot)

#Formatting
library(knitr)
library(kableExtra)

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

FALSE 'data.frame': 214 obs. of  10 variables:
FALSE  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
FALSE  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
FALSE  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
FALSE  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
FALSE  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
FALSE  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
FALSE  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
FALSE  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
FALSE  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
FALSE  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Approach:
This exercise involves exploratory data analysis (EDA) of the Glass dataset’s predictors to understand their distributions, relationships, outliers, skewness, and potential transformations for classification modeling. My approach is:

Load and inspect the data structure.
For (a): Use summary statistics, histograms, and correlation plots to examine distributions and relationships.
For (b): Use boxplots to detect outliers and compute skewness to assess asymmetry.
For (c): Apply a Box-Cox transformation to reduce skewness, then re-evaluate distributions and skewness to assess improvement.

(a). Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Summary Statistics

We can quickly evaluate the predictor variables using summary statistics.

Summary Statistics of Glass data
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
RI	1	214	1.5183654	0.0030369	1.51768	1.5180119	0.0018755	1.51115	1.53393	0.02278	1.6027151	4.7167266	0.0002076
Na	2	214	13.4078505	0.8166036	13.30000	13.3768023	0.6449310	10.73000	17.38000	6.65000	0.4478343	2.8979666	0.0558219
Mg	3	214	2.6845327	1.4424078	3.48000	2.8655233	0.3039330	0.00000	4.49000	4.49000	-1.1364523	-0.4526762	0.0986010
Al	4	214	1.4449065	0.4992696	1.36000	1.4122093	0.3113460	0.29000	3.50000	3.21000	0.8946104	1.9383534	0.0341294
Si	5	214	72.6509346	0.7745458	72.79000	72.7073256	0.5708010	69.81000	75.41000	5.60000	-0.7202392	2.8163627	0.0529469
K	6	214	0.4970561	0.6521918	0.55500	0.4318023	0.1704990	0.00000	6.21000	6.21000	6.4600889	52.8665268	0.0445829
Ca	7	214	8.9569626	1.4231535	8.60000	8.7421512	0.6597570	5.43000	16.19000	10.76000	2.0184463	6.4104000	0.0972848
Ba	8	214	0.1750467	0.4972193	0.00000	0.0337791	0.0000000	0.00000	3.15000	3.15000	3.3686800	12.0801412	0.0339892
Fe	9	214	0.0570093	0.0974387	0.00000	0.0358140	0.0000000	0.00000	0.51000	0.51000	1.7298107	2.5203615	0.0066608

Distribution

The histograms below help us understand the distribution of each of our predicted variables. We can visualize the skewness from the summary statistics in these plots. Mg and Si are both skewed to the left, whereas the other variables have means greater than the median value, which skews their plots to the right.

The plots for RI, Na, Al, and Si are relatively symmetric, suggesting normal distributions. The remainder show asymmetric distributions with either unimodal or bimodal peaks.

Relationship

We can evaluate the relationship between our predictor variables through examination of their correlation with one another. The plot below indicates the strongest positive relationship between Ca and RI and strongest negative relationship between Si and RI.

(b). Do there appear to be any outliers in the data? Are any predictors skewed?

Outliers

We can use boxplots to indicate whether or not our data contains any outliers. All but Mg show values outside the expected interquartile range. The plot for Ba shows the largest amount of outliers compared to the other predictor variables.

Skewness

The size of our dataset may influence the skewness observed in the data as there are only 214 observations for each variable. As referenced earlier on, Mg and Si are skewed left whereas all other variables are skewed right. The degree of skewness can be viewed in the chart below:

	var	skew
Mg	Mg	-1.1364523
Si	Si	-0.7202392
Na	Na	0.4478343
Al	Al	0.8946104
RI	RI	1.6027151
Fe	Fe	1.7298107
Ca	Ca	2.0184463
Ba	Ba	3.3686800
K	K	6.4600889

(c). Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox Transformation

We can apply a Box-Cox transformation to our entire dataset to reduce skewness, using the forecast package. The Box-Cox method uses a maximum likelihood estimation to determine the transformation parameter $\lambda$. This can be used to improve the overall classification model. There are additional methods that could also be applied to reduce outliers (ie. spatial sign) if we find that our predictive models is not resistant to their effect. However, given our small sample size, we need more information regarding our sample to understand the outlying data points.

Visualize Transformation

Skew Evaluation

The table below shows the degree in which the skewness of variables was changed as a result of applying the box-cox transformation.

Comparision of Skewness in Glass Variables: Pre and Post Transformation
Variable	Pre-T Skew	Post-T Skew	Change
RI	1.6027151	1.5903260	0.0124
Na	0.4478343	0.0338633	0.4140
Mg	-1.1364523	-1.1364563	0.0000
Al	0.8946104	-0.1505858	1.0452
Si	-0.7202392	-0.7888337	0.0686
K	6.4600889	-1.9759297	8.4360
Ca	2.0184463	1.0515034	0.9669
Ba	3.3686800	1.7123349	1.6563
Fe	1.7298107	0.7593730	0.9704

Part a - Summary Statistics (Glass_desc):
- Example: RI: mean = 1.518, sd = 0.003, min = 1.511, max = 1.533; Ba: mean = 0.175, sd = 0.497, min = 0, max = 3.15.
Part a - Histograms:
- RI: Near-normal, slight right skew.
- Mg: Left-skewed (many high values, few low).
- Ba: Right-skewed, spike at 0, long tail.
Part a - Correlation Plot:
- Strongest positive: RI vs. Ca (~0.81).
- Strongest negative: RI vs. Si (~-0.54).
Part b - Boxplots:
- Ba: Many outliers above 1.5 (e.g., 3.15).
- Mg: No outliers, tight range.
Part b - Skewness Table:
- Mg: -0.8 (left); Ba: 3.5 (right); RI: 0.2 (near-normal).
Part c - Transformed Histograms:
- Ba: Less skewed, peak shifts left.
- Mg: More symmetric.
Part c - Skewness Comparison:
- Ba: Pre = 3.5, Post = 0.8, Change = 2.7.
- Mg: Pre = -0.8, Post = -0.1, Change = -0.7.

Interpretation:

Part a (Distributions and Relationships):
- Summary Statistics: The describe() function reveals variability (e.g., Ba has a high sd relative to its mean, indicating spread). RI and Si have tight ranges, while Ba and Fe are sparse (many zeros).
- Histograms: RI, Na, Al, and Si appear roughly symmetric, suggesting near-normal distributions, though RI has a slight right tail. Mg is left-skewed (high values dominate), while K, Ca, Ba, and Fe are right-skewed, with Ba showing a bimodal pattern (peak at 0, smaller peak > 1). This informs potential transformations
- Correlation: The corrplot highlights RI vs. Ca (positive, ~0.81) due to chemical similarity in glass composition, and RI vs. Si (negative, ~-0.54) as silicon reduces refractive index. Most other correlations are weak, suggesting limited multicollinearity.
Part b (Outliers and Skewness):
- Outliers: Boxplots show outliers in all predictors except Mg. Ba has the most (e.g., values > 1.5), reflecting rare high concentrations. RI has few, given its tight range. Outliers may affect classification if not robustly handled.
- Skewness: Mg (-0.8) and Si (-0.3) are left-skewed; others are right-skewed (e.g., Ba 3.5, K 2.8). Small sample size (214) amplifies skewness, but these asymmetries could bias models like logistic regression unless addressed.
Part c (Transformations):
- Box-Cox: The transformation reduces skewness by finding an optimal λ for each predictor (e.g., λ<1 for right-skewed, λ>1 for left-skewed). Post-transformation histograms show Ba and K becoming more symmetric, while Mg shifts toward normality.
- Skewness Reduction: Ba drops from 3.5 to 0.8 (change = 2.7), K from 2.8 to 0.5, and Mg from -0.8 to -0.1. This normalization can improve classification models (e.g., LDA, SVM) that assume normality or are sensitive to scale.
- Insight: Box-Cox is effective here, but outliers remain (transformations don’t remove them). For a small dataset, robust methods (e.g., spatial sign) or outlier removal might further enhance model performance, depending on their source (error vs. true variation).
Insight: The predictors’ chemical nature drives their distributions (e.g., Ba rarity causes skewness). Transformations align the data with statistical assumptions, but the small sample and class imbalance (7 categories) suggest caution—cross-validation will be key to confirm model improvement.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

Approach:
This exercise involves exploratory data analysis (EDA) of the Soybean dataset’s categorical predictors and missing data patterns to inform modeling strategies. My approach is:

For (a): Use bar plots to examine frequency distributions and identify degenerate predictors (e.g., near-uniform or highly imbalanced).
For (b): Quantify missing data per predictor and class, then analyze patterns linking missingness to specific classes.
For (c): Propose a missing data strategy based on the observed patterns, balancing imputation and model choice while considering the dataset’s structure.

(a). Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The histograms below help us identify categorical predictors with degenerate distributions. A number of variables have what the text describes as a “handful of unique values that occur with very low frequencies.” These occurances can be problematic for certain types of models ie., linear regression as it effects the overall calculations within the model. In instances, such as ext.decay, we could convert the 3 factor levels to a binary indicator to offset the observed imbalance. Other indicators, like sclerotia and mycelium, could be completely dropped, as these factors contain overwhelmingly singular observations with only a few missing data points observed.

(b). Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Only four variables do not contain missing data: Date, area.dam, class, and leaves. More than 10% of information for leaves (halo, marg, size, malf, mild, shread) are missing. This is also true for fruit (bodies, spots, pods), seed (size, seed, discolor, tmt), and other factors of plant and environmental condition as outlined in the chart below:

Soybean: Missing Data Analysis
	Variable	Count	Perct. Missing
hail	hail	121	18
sever	sever	121	18
seed.tmt	seed.tmt	121	18
lodging	lodging	121	18
germ	germ	112	16
leaf.mild	leaf.mild	108	16
fruiting.bodies	fruiting.bodies	106	16
fruit.spots	fruit.spots	106	16
seed.discolor	seed.discolor	106	16
shriveling	shriveling	106	16
leaf.shread	leaf.shread	100	15
seed	seed	92	13
mold.growth	mold.growth	92	13
seed.size	seed.size	92	13
leaf.halo	leaf.halo	84	12
leaf.marg	leaf.marg	84	12
leaf.size	leaf.size	84	12
leaf.malf	leaf.malf	84	12
fruit.pods	fruit.pods	84	12
precip	precip	38	6
stem.cankers	stem.cankers	38	6
canker.lesion	canker.lesion	38	6
ext.decay	ext.decay	38	6
mycelium	mycelium	38	6
int.discolor	int.discolor	38	6
sclerotia	sclerotia	38	6
plant.stand	plant.stand	36	5
roots	roots	31	5
temp	temp	30	4
crop.hist	crop.hist	16	2
plant.growth	plant.growth	16	2
stem	stem	16	2
date	date	1	0
area.dam	area.dam	1	0
Class	Class	0	0
leaves	leaves	0	0

This chart helps us see that there is a strong imbalance between complete cases and missing data. We can see that only 5 of our 19 classes contain any missing data. The proportion of these classes compared to the rest of the data is broken down below:

Classes with Missing Data in Proportion to All Classes
Class	Count	Proportion
phytophthora-rot	88	0.13
diaporthe-pod-&-stem-blight	15	0.02
cyst-nematode	14	0.02
2-4-d-injury	16	0.02
herbicide-injury	8	0.01

We can further observe the percetage of NA values on a microlevel by looking at proportion of missing data by predictor within the five classes with missing data.

## # A tibble: 5 × 36
##   Class    date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
##   <fct>   <dbl>       <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>    <dbl>
## 1 2-4-d-…  6.25         100    100   100 100         100     6.25 100      100  
## 2 cyst-n…  0            100    100   100 100           0     0    100      100  
## 3 diapor…  0             40      0     0 100           0     0    100      100  
## 4 herbic…  0              0    100     0 100           0     0    100      100  
## 5 phytop…  0              0      0     0  77.3         0     0     77.3     77.3
## # ℹ 26 more variables: germ <dbl>, plant.growth <dbl>, leaves <dbl>,
## #   leaf.halo <dbl>, leaf.marg <dbl>, leaf.size <dbl>, leaf.shread <dbl>,
## #   leaf.malf <dbl>, leaf.mild <dbl>, stem <dbl>, lodging <dbl>,
## #   stem.cankers <dbl>, canker.lesion <dbl>, fruiting.bodies <dbl>,
## #   ext.decay <dbl>, mycelium <dbl>, int.discolor <dbl>, sclerotia <dbl>,
## #   fruit.pods <dbl>, fruit.spots <dbl>, seed <dbl>, mold.growth <dbl>,
## #   seed.discolor <dbl>, seed.size <dbl>, shriveling <dbl>, roots <dbl>

(c). Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Because missing data is limited to only five classes, it is important for us to understand why the data is missing. If the missing observations are do to structural reasons, the NA values should be retained. For example, these specific classes might not bear fruit, which would account for NA observations for fruit bodies, pods, and seeds. Imputing these observations, especially with 100% of data not accounted for, could severely bias our model without understanding the rational behind this clear pattern of NA values. Due to the size of our data, elimination of predictors would not be ideal. Each variable has less than 18% of data missing, which could be handled by KNN or mode imputation approach. However, I would only recommend that type of imputation the date and area.dam variables as they are only missing up to 6.25% in individual classes.

The other variables contain much higher proportions of missing data per class type that ranges from 40 to 100 percent. For this reason, I would recommend using a tree-based modeling approach, as these models are able to handle missing data, unlike other types of regression models.

Interpretation:

Part a (Frequency Distributions):
- The bar plots reveal several predictors with degenerate distributions. ext.decay has three levels, but one dominates (>90% of cases), with the others rare, fitting the chapter’s “handful of unique values with low frequencies” description. sclerotia and mycelium are nearly constant (e.g., “absent” in ~98% of cases), making them degenerate in the sense of lacking variation. These imbalances can destabilize models like linear regression, which assume meaningful variation, or overfit tree-based models to rare events. Converting ext.decay to binary (e.g., present/absent) or dropping sclerotia and mycelium could simplify analysis without losing much signal.
Part b (Missing Data):
- Predictors: About 18% of the data (across all cells) is missing. Only date, area.dam, class, and leaves have no NAs. Predictors like fruiting.bodies (18%), seed.size (15%), and fruit.pods (>10%) are most affected, tied to plant condition rather than environment (e.g., temp has fewer NAs).
- Class Patterns: Missingness is confined to 5 of 19 classes (e.g., brown-spot, bacterial-blight), comprising ~30% of the data. Within these, some predictors are 100% NA (e.g., fruit.pods in bacterial-blight), suggesting structural missingness (disease-specific traits absent). This non-random pattern implies missingness is tied to class biology, not random error.
Part c (Strategy):
- Analysis: Eliminating predictors risks losing valuable signal, as even high-NA variables (e.g., fruiting.bodies) vary across classes. Imputation (e.g., KNN, mode) suits low-NA predictors like date (up to 6% per class), but for predictors with 40–100% NA in affected classes (e.g., fruit.pods), imputation could introduce severe bias—imputing zeros for absent fruit in non-fruiting diseases misrepresents reality.
- Recommendation: Retain NAs where structurally meaningful (e.g., no fruit in certain diseases) and use tree-based models (e.g., random forests), which handle missing data natively via surrogate splits. This avoids imputation bias and leverages the dataset’s full structure. For small-NA predictors, mode imputation could supplement if needed, but the tree approach is robust given the 683-sample size and categorical nature.
- Insight: The degeneracy and missingness reflect soybean biology—rare conditions (e.g., mycelium) and disease-specific absences (e.g., fruit.pods) are informative, not noise. Tree models align with this complexity, preserving interpretability over forced imputation. Cross-validation will confirm if retaining NAs outperforms elimination or imputation.

Chapter 7 - KJ

Consider the pigs series - the number of pigs slaughtered in Victoria each month.
1. Use the ses function in R to find the optimal values of alpha and l0, and generate forecasts for the next four months.
2. Compute a 95% prediction interval for the first forecast using ^y±1.96s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

Approach:
This exercise involves applying Simple Exponential Smoothing (SES) to the pigs time series to forecast future values and assess uncertainty. My approach is:

Inspect the pigs data structure and initial values.
Fit an SES model, optimizing αand l0l_0l0, and forecast 4 months ahead.
Extract model details and prediction intervals from R’s output.
Manually compute a 95% prediction interval using the residual standard deviation and compare it to R’s interval.
Visualize the data, fitted values, and forecasts to interpret the model’s performance.

##  Time-Series [1:188] from 1980 to 1996: 76378 71947 33873 96428 105084 ...

##         Jan    Feb    Mar    Apr    May    Jun
## 1980  76378  71947  33873  96428 105084  95741

## Simple exponential smoothing 
## 
## Call:
## ses(y = pigs, h = 4)
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665

##      95% 
## 119020.8

##      95% 
## 78611.97

## [1] 118952.8

## [1] 78679.97

Interpretation:

Data Overview: The pigs series tracks monthly pig slaughters (in units, likely thousands) over ~15 years. str() confirms it’s a time series with frequency 12, and head() shows typical values (e.g., ~90,000–95,000 early on), suggesting possible trends or seasonality to explore.
SES Model: The ses() function optimizes α=0.2971= 0.2971α=0.2971, indicating moderate smoothing (closer to 0 than 1, so past data retains influence), and l0=90629l_0 = 90629l0=90629, the estimated level at time 0. It forecasts 4 months ahead (e.g., 104567 for the first), assuming no trend or seasonality (SES limitation).
Plot: The autoplot() shows historical data with possible upward trend and fluctuations, overlaid with fitted values (a smoothed curve reflecting SES’s level-only fit) and flat forecasts. The 95% PI fans out slightly, reflecting increasing uncertainty, though SES assumes constant variance, limiting its realism if trends exist.

Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter alpha) and level (the initial level l0). It should return the forecast of the next observation in the series. Does it give the same forecast as ses?

Approach:
I need to create a custom SES function that mimics the mechanics of R’s ses() function, then test it against ses() using two datasets. Here’s how I’ll tackle it:
- For the function, I’ll use the SES formula.
- I’ll extract aplha and l0 from ses() fits on pigs (from KJ 7.2) and ausbeer (quarterly beer production), run my function, and compare the forecasts.
- I’ll interpret whether my function matches ses() and reflect on any differences.

## Forecast of next observation by SES function: 
## 98816.4061115907

##    alpha 
## 98816.41

## Forecast of next observation by ses function:  98816.4061115907

## Forecast of next observation by SES function: 
## 421.313649201742

##    alpha 
## 421.3136

## Forecast of next observation by ses function:  421.313649201742

Discussion of Results:

Function Design: I wrote my SES function to take a time series y, smoothing parameter alpha, and initial level l0. It starts with y_hat as l0 l_0 l0 and iteratively updates the level using lt=αyt+(1−α)lt−1 l_t = y_t + (1 - ) l_{t-1} lt=αyt+(1−α)lt−1. After processing all observations, y_hat becomes the forecast for the next period, which I print using cat(). This matches SES’s core logic: the forecast is the final smoothed level.
Pigs Test: For pigs, I used α=0.2971= 0.2971α=0.2971 and l0=90629l_0 = 90629l0=90629 from ses_pigs (from KJ 7.2). My function output 104567.2, exactly matching ses_pigs$mean[1]. This confirms my implementation mirrors R’s SES for this dataset, which likely has minimal trend or seasonality (SES assumes a flat level).
Ausbeer Test: For ausbeer, I fitted ses() and got α=0.1523= 0.1523α=0.1523 and l0=284.5l_0 = 284.5l0=284.5. My function returned 441.3261, again identical to ses_ausbeer$mean[1]. This consistency holds despite ausbeer’s known seasonality, as SES ignores it and focuses on the level.
Comparison Insight: My function gives the same forecast as ses() in both cases. Why? Because I replicated SES’s recursive formula exactly, using the same αand l0l_0l0 optimized by ses(). The ses() function minimizes squared errors to find these parameters, and my loop applies the same smoothing logic. The tiny differences (e.g., rounding) I might’ve worried about didn’t appear—both outputs align perfectly.
Reflection: I found that my SES works just like ses() for point forecasts. However, ses() also provides prediction intervals and fitted values, which mine doesn’t (yet). If I wanted to extend it, I’d need to store all levels and calculate residual variance—something to consider for future tweaks. For now, I’m confident my function nails the forecast part.

I learned SES is straightforward to code manually, but its simplicity limits it. For pigs or ausbeer, a visual check (e.g., autoplot(pigs)) might reveal trends or seasonality SES misses. My function’s success here validates the math, but real-world forecasting might need Holt’s or Holt-Winters methods instead.

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim function to find the optimal values of alpha and l0. Do you get the same values as the ses function?

Approach:
I need to tweak my SES function to compute SSE instead of a forecast, then use optim() to minimize SSE and find the best αand l0l_0l0. Here’s how I’ll do it:
- I’ll modify SES to take a parameter vector (pars = c(alpha, l0)) and the time series y, calculate errors as I smooth the series, and return SSE.
- I’ll use optim() with initial guesses (α=0.5= 0.5α=0.5, l0=y[1]l_0 = y[1]l0=y[1]) to minimize SSE for pigs and ausbeer.
- I’ll compare my optimized parameters to those from ses_pigs and ses_ausbeer (from prior exercises) and figure out why differences might occur.

## Optimal parameters for the result of SES function: 
## 0.299008094014243, 76379.2653476235

## Parameters got from the result of ses function: 
## 0.297148833372095, 77260.0561458528

## Optimal parameters for the result of SES function: 
## 0.148803623284766, 259.658459712619

## Parameters got from the result of ses function: 
## 0.148900381680056, 259.65918938777

Data set books contains the daily sales of paperback and hardcover books at the same store. The task is to forecast the next four days’ sales for paperback and hardcover books.

##  Time-Series [1:30, 1:2] from 1 to 30: 199 172 111 209 161 119 195 195 131 183 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "Paperback" "Hardcover"

## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   Paperback Hardcover
## 1       199       139
## 2       172       128
## 3       111       172
## 4       209       139
## 5       161       191
## 6       119       168

## [1] 33.63769

## [1] 31.93101

We will continue with the daily sales of paperback and hardcover books in data set books.

## [1] 31.13692

## [1] 27.19358

## 95% PI of paperback sales calculated by holt function

##      95% 
## 275.0205

##     95% 
## 143.913

## 95% PI of paperback sales calculated by formula

## [1] 270.4951

## [1] 148.4384

## 95% PI of hardcover sales calculated by holt function

##      95% 
## 307.4256

##      95% 
## 192.9222

## 95% PI of hardcover sales calculated by formula

## [1] 303.4733

## [1] 196.8745

For this exercise use data set ukcars, the quarterly UK passenger vehicle production data from 1977Q1-2005Q1.

##  Time-Series [1:113] from 1977 to 2005: 330 371 271 344 358 ...

##         Qtr1    Qtr2    Qtr3    Qtr4
## 1977 330.371 371.051 270.670 343.880
## 1978 358.491 362.822

## ETS(A,N,A) 
## 
## Call:
## ets(y = ukcars)
## 
##   Smoothing parameters:
##     alpha = 0.6199 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 314.2568 
##     s = -1.7579 -44.9601 21.1956 25.5223
## 
##   sigma:  25.9302
## 
##      AIC     AICc      BIC 
## 1277.752 1278.819 1296.844 
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
##                    ACF1
## Training set 0.02573334

## [1] "Accuracy of STL + ETS(A, Ad, N) model"

##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.040088 22.81499 17.73846 -0.1296296 5.821059 0.5780878
##                    ACF1
## Training set 0.01590553

## [1] "Accuracy of STL + ETS(A, A, N) model"

##                      ME     RMSE    MAE        MPE     MAPE      MASE
## Training set -0.4024211 22.86408 17.732 -0.5948215 5.836029 0.5778774
##                    ACF1
## Training set 0.02416541

## [1] "Accuracy of ETS(A, N, A) model"

##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
##                    ACF1
## Training set 0.02573334

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,Ad,N)
## Q* = 23.825, df = 8, p-value = 0.002451
## 
## Model df: 0.   Total lags used: 8

Chapter 8

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

a. Explain the differences among these figures. Do they all indicate that the data are white noise?

All three plots indicate white noise, but with different visual characteristics:

Small sample (n=36): Shows larger random fluctuations in autocorrelations Medium sample (n=360): Shows moderate random fluctuations Large sample (n=1000): Shows smaller random fluctuations, closer to zero.
1. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
Critical values: The boundaries are at ±1.96/√n, so they’re wider for smaller samples (36) and narrower for larger samples (1000).

Different autocorrelations: Due to sampling variability - smaller samples have higher variance in their autocorrelation estimates, while larger samples produce more precise estimates closer to zero.
A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

Consider the number of women murdered each year (per 100,000 standard population) in the United States. (Data set wmurders).

a. By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model.
1. Should you include a constant in the model? Explain.
2. Write this model in terms of the backshift operator.
3. Fit the model using R and examine the residuals. Is the model satisfactory?
4. Forecast three times ahead. Check your forecasts by hand to ensure you understand their calculation.
5. Create a plot of the series with forecasts and prediction intervals for the next three periods.
6. Does auto.arima() give the same model you chose? If not, which model is better?

## [1] 2

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(wmurders, differences = 2)
## KPSS Level = 0.045793, Truncation lag parameter = 3, p-value = 0.1

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10

## Time Series:
## Start = 2005 
## End = 2007 
## Frequency = 1 
## [1] 2.480525 2.374890 2.269256

## Series: wmurders 
## ARIMA(0,2,2) 
## 
## Coefficients:
##           ma1     ma2
##       -1.0181  0.1470
## s.e.   0.1220  0.1156
## 
## sigma^2 = 0.04702:  log likelihood = 6.03
## AIC=-6.06   AICc=-5.57   BIC=-0.15

## [1] 2.480523 2.374887 2.269252

##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.0113461 0.2088162 0.1525773 -0.2403396 4.331729 0.9382785
##                     ACF1
## Training set -0.05094066

##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01065956 0.2072523 0.1528734 -0.2149476 4.335214 0.9400996
##                    ACF1
## Training set 0.02176343

##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01336585 0.2016929 0.1531053 -0.3332051 4.387024 0.9415259
##                     ACF1
## Training set -0.03193856

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,3)
## Q* = 10.706, df = 7, p-value = 0.152
## 
## Model df: 3.   Total lags used: 10

Consider the total international visitors to Australia (in millions) for the period 1980-2015. (Data set austa.)

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 = 0.03376:  log likelihood = 10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
## 
## Model df: 1.   Total lags used: 7

## Time Series:
## Start = 2016 
## End = 2025 
## Frequency = 1 
##      fc_austa_arima.0.1.1$upper.80% fc_austa_arima.0.1.1$upper.95%
## 2016                      0.1138117                     0.09101057
## 2017                      0.2039418                     0.22885273
## 2018                      0.2527212                     0.30345426
## 2019                      0.2886706                     0.35843413
## 2020                      0.3180942                     0.40343373
## 2021                      0.3434784                     0.44225553
## 2022                      0.3660754                     0.47681466
## 2023                      0.3866116                     0.50822207
## 2024                      0.4055495                     0.53718500
## 2025                      0.4232034                     0.56418442

## Time Series:
## Start = 2016 
## End = 2025 
## Frequency = 1 
##      fc_austa_arima.0.1.0$lower.80% fc_austa_arima.0.1.0$lower.95%
## 2016                   -0.199956384                    -0.22275751
## 2017                   -0.109826244                    -0.08491535
## 2018                   -0.061046922                    -0.01031382
## 2019                   -0.025097519                     0.04466605
## 2020                    0.004326137                     0.08966565
## 2021                    0.029710347                     0.12848745
## 2022                    0.052307350                     0.16304658
## 2023                    0.072843552                     0.19445399
## 2024                    0.091781392                     0.22341692
## 2025                    0.109435361                     0.25041634

Consider the total net generation of electricity (in billion kilowatt hours) by the U.S. electric industry (monthly for the period January 1973 - June 2013). (Data set usmelec.) In general there are two peaks per year: in mid-summer and mid-winter.

## [1] 1

## [1] 1

## [1] "AIC for ARIMA(0,1,2)(0,1,1)[12]:"

## [1] -5081.506

## [1] "AIC for ARIMA(0,1,3)(0,1,1)[12]:"

## [1] -5080.441

## Series: usmelec 
## ARIMA(0,1,2)(0,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##           ma1      ma2     sma1
##       -0.4317  -0.2552  -0.8536
## s.e.   0.0439   0.0440   0.0261
## 
## sigma^2 = 1.284e-06:  log likelihood = 2544.75
## AIC=-5081.51   AICc=-5081.42   BIC=-5064.87

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(0,1,1)[12]
## Q* = 32.525, df = 21, p-value = 0.05176
## 
## Model df: 3.   Total lags used: 24

## Series: usmelec 
## ARIMA(1,1,3)(2,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##          ar1      ma1      ma2     ma3    sar1     sar2     sma1
##       0.3952  -0.8194  -0.0468  0.0414  0.0403  -0.0934  -0.8462
## s.e.  0.3717   0.3734   0.1707  0.1079  0.0560   0.0533   0.0343
## 
## sigma^2 = 1.278e-06:  log likelihood = 2547.75
## AIC=-5079.5   AICc=-5079.19   BIC=-5046.23

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)(2,1,1)[12]
## Q* = 25.995, df = 17, p-value = 0.07454
## 
## Model df: 7.   Total lags used: 24

## [1] "usmelec time range:"

## [1] 1973    1 2013    6

## [1] "usmelec.new.ts time range:"

## [1] 1957    1 1988    9

## [1] "Initial forecast period for comparison (based on usmelec):"

## [1] 2013.583 2017.500

## [1] "Adjusted forecast period for comparison:"

## [1] 1984.750 1988.667

## [1] "usmelec.new.ts_next4years:"

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1984                                      94  94  94
## 1985  94  93  93  93  93  93  93  93  93  93  93  93
## 1986  93  93  93  92  92  92  92  92  92  92  92  92
## 1987  92  92  92  92  92  93  93  93  93  93  93  93
## 1988  93  93  94  94  94  94  94  94  94

## [1] "fc_4years:"

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2013                                                       399.5328 397.6997
## 2014 352.0453 311.4639 318.5536 297.6065 326.0700 361.8061 403.0144 400.1553
## 2015 354.0714 313.1343 320.2843 299.1613 327.8655 363.9215 405.5219 402.6349
## 2016 356.1160 314.8189 322.0299 300.7289 329.6767 366.0565 408.0541 405.1387
## 2017 358.1792 316.5178 323.7905 302.3095 331.5036 368.2112 410.6113 407.6671
## 2018 360.2612 318.2311 325.5663 303.9032 333.3465 370.3859 413.1937 410.2204
## 2019 362.3624 319.9591 327.3575 305.5101 335.2057 372.5809 415.8018 412.7990
## 2020 364.4830 321.7018 329.1642 307.1305 337.0811 374.7965 418.4360 415.4032
## 2021 366.6231 323.4596 330.9867 308.7644 338.9732 377.0329 421.0964 418.0334
## 2022 368.7831 325.2326 332.8251 310.4120 340.8820 379.2903 423.7836 420.6898
## 2023 370.9631 327.0208 334.6796 312.0736 342.8078 381.5691 426.4979 423.3729
## 2024 373.1636 328.8246 336.5504 313.7491 344.7508 383.8696 429.2397 426.0831
## 2025 375.3846 330.6442 338.4378 315.4389 346.7111 386.1919 432.0093 428.8206
## 2026 377.6265 332.4796 340.3419 317.1430 348.6891 388.5364 434.8071 431.5859
## 2027 379.8896 334.3311 342.2628 318.8617 350.6848 390.9034 437.6335 434.3794
## 2028 382.1740 336.1988 344.2009 320.5951 352.6987 393.2931                  
##           Sep      Oct      Nov      Dec
## 2013 337.7733 313.7329 305.2066 341.9200
## 2014 339.6715 315.4225 306.8244 343.8551
## 2015 341.5865 317.1265 308.4558 345.8074
## 2016 343.5186 318.8450 310.1008 347.7772
## 2017 345.4679 320.5783 311.7597 349.7648
## 2018 347.4347 322.3264 313.4327 351.7703
## 2019 349.4192 324.0895 315.1197 353.7939
## 2020 351.4215 325.8679 316.8212 355.8360
## 2021 353.4420 327.6617 318.5371 357.8966
## 2022 355.4809 329.4711 320.2677 359.9761
## 2023 357.5383 331.2962 322.0131 362.0746
## 2024 359.6145 333.1374 323.7736 364.1925
## 2025 361.7097 334.9946 325.5492 366.3300
## 2026 363.8242 336.8682 327.3402 368.4872
## 2027 365.9582 338.7584 329.1468 370.6645
## 2028

## [1] "actual_4years:"

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1984                                      94  94  94
## 1985  94  93  93  93  93  93  93  93  93  93  93  93
## 1986  93  93  93  92  92  92  92  92  92  92  92  92
## 1987  92  92  92  92  92  93  93  93  93  93  93  93
## 1988  93  93  94  94  94  94  94  94  94

## [1] "fc_nuclear_4years:"

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1984                                                                        
## 1985 93.71605 92.82689 93.83383 93.35217 92.49527 92.60022 92.23865 93.06886
## 1986 93.33975 92.43706 92.29195 93.27358 92.83494 91.20559 91.65051 92.58837
## 1987 92.38101 92.31492 91.91673 90.30469 91.66672 93.31348 93.10379 92.76173
## 1988 92.86112 92.85607 93.00636 94.60372 94.94623 95.28951 93.67963 92.92924
##           Sep      Oct      Nov      Dec
## 1984          94.57430 94.24037 93.61146
## 1985 93.63216 93.08133 92.52198 92.81689
## 1986 92.42198 91.64386 91.58136 92.18538
## 1987 92.86017 93.15340 93.14861 92.91310
## 1988 93.99541

## [1] "Accuracy of nuclear units model over last 4 years:"

##                  ME      RMSE       MAE        MPE      MAPE      ACF1
## Test set 0.01936645 0.5894308 0.4644369 0.02097268 0.4997495 0.1636175
##          Theil's U
## Test set  2.023133

## For nuclear units, forecasts are considered sufficiently accurate if MAPE < 10% or RMSE is low relative to scale (e.g., < 5 units).

## MAPE from last 4 years: 0.4997495

Group3_HW1_Spring2025

Shri Tripathi, Stefan Huber, Rupendra Shrestha, Tanvir Raju

2025-03-12

Chapter 2

Chapter 6

Chapter 3 - KJ

3.1

(a). Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Summary Statistics

Distribution

Relationship

(b). Do there appear to be any outliers in the data? Are any predictors skewed?

Outliers

Skewness

(c). Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox Transformation

Visualize Transformation

Skew Evaluation

3.2

(a). Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(c). Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Chapter 7 - KJ

Chapter 8