library(fpp2)
library(seasonal)
library(tidyverse)
library(reshape2)
library(corrplot)

HA 2.1

Use the help function to explore what the series gold, woolyrnq and gas represent.

help(gold)
## starting httpd help server ... done
help(woolyrnq)
help(gas)

Part a

Use autoplot() to plot each of these in separate plots.

autoplot(gold)

autoplot(woolyrnq)

autoplot(gas)

Part b

What is the frequency of each series? Hint: apply the frequency() function.

frequency(gold)
## [1] 1
frequency(woolyrnq)
## [1] 4
frequency(gas)
## [1] 12

Part c

Use which.max() to spot the outlier in the gold series. Which observation was it?

which.max(gold)
## [1] 770

HA 2.3

Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.

Part a

You can read the data into R with the following script:

retaildata <- readxl::read_excel("retail.xlsx", skip=1)
#summary(retaildata)

Part b

Select one of the time series as follows (but replace the column name with your own chosen column):

myts <- ts(retaildata[,"A3349788J"],
  frequency=12, start=c(1982,4))

Part c

Explore your chosen retail time series using the following functions:

autoplot(myts)

ggseasonplot(myts)

ggsubseriesplot(myts)

gglagplot(myts)

ggAcf(myts)

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

It is absiolutely seasonal and cyclical, peaking in December after a steady rise starting in October. It steeply drops down in January.

HA 6.2

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

Part a

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?

plastics
##    Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1  742  697  776  898 1030 1107 1165 1216 1208 1131  971  783
## 2  741  700  774  932 1099 1223 1290 1349 1341 1296 1066  901
## 3  896  793  885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4  951  861  938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013
autoplot(plastics)

Sales are significantly better in the summer and early fall months: May to October

Part b

Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition
    of plastics")

Part c

Do the results support the graphical interpretation from part a? Yes

Part d

Compute and plot the seasonally adjusted data.

Yet to finish

Part e

Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

Yet to finish

Part f

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Yet to finish

KJ 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Part a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
Glass_Long <- pivot_longer(Glass, names_to = "Var",values_to = "Val", cols = -Type) 

#%>%
#  ggplot(x=Val) + geom_histogram()
ggplot(data = melt(Glass), aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
## Using Type as id variables

corrplot.mixed(cor(Glass[,-10]))

Part b

Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, great deal of outliers. Most notably in Ba, Ca, Fe.

ggplot(data = melt(Glass), aes(x = value)) + 
geom_boxplot() + 
facet_wrap(~variable, scales = "free")
## Using Type as id variables

Part c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, see below:

melt.glass <- melt(Glass)
## Using Type as id variables
# CA
melt.glass.Ca.old <- melt.glass %>% 
                      filter(variable == "Ca")

melt.glass.Ca.old$age <- "Old"

melt.glass.Ca.new <- melt.glass.Ca.old

melt.glass.Ca.new$age <- "New"

melt.glass.Ca.new$value <- melt.glass.Ca.old$value^2

melt.glass.Ca <- rbind(melt.glass.Ca.new, melt.glass.Ca.old)

ggplot(data = melt.glass.Ca, aes(x = value)) + 
stat_density() + 
facet_wrap(~age)

ggplot(data = melt.glass.Ca, aes(x = value)) + 
geom_boxplot() + 
facet_wrap(~age)

# Ba
melt.glass.Ba.old <- melt.glass %>% 
                      filter(variable == "Ba")

melt.glass.Ba.old$age <- "Old"

melt.glass.Ba.new <- melt.glass.Ba.old

melt.glass.Ba.new$age <- "New"

melt.glass.Ba.new$value <- log(melt.glass.Ba.old$value)

melt.glass.Ba <- rbind(melt.glass.Ba.new, melt.glass.Ba.old)

ggplot(data = melt.glass.Ba, aes(x = value)) + 
stat_density() + 
facet_wrap(~age)
## Warning: Removed 176 rows containing non-finite values (stat_density).

ggplot(data = melt.glass.Ba, aes(x = value)) + 
geom_boxplot() + 
facet_wrap(~age)
## Warning: Removed 176 rows containing non-finite values (stat_boxplot).

# Fe
melt.glass.Fe.old <- melt.glass %>% 
                      filter(variable == "Fe")

melt.glass.Fe.old$age <- "Old"

melt.glass.Fe.new <- melt.glass.Fe.old

melt.glass.Fe.new$age <- "New"

melt.glass.Fe.new$value <- log(melt.glass.Fe.old$value)

melt.glass.Fe <- rbind(melt.glass.Fe.new, melt.glass.Fe.old)


ggplot(data = melt.glass.Fe, aes(x = value)) + 
stat_density() + 
facet_wrap(~age)
## Warning: Removed 144 rows containing non-finite values (stat_density).

ggplot(data = melt.glass.Fe, aes(x = value)) + 
geom_boxplot() + 
facet_wrap(~age)
## Warning: Removed 144 rows containing non-finite values (stat_boxplot).

KJ 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. http://archive.ics.uci.edu/ml/index.html.

library(mlbench)
data(Soybean)

Part a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

for (i in 3:ncol(Soybean)){
  Soybean[,i] <- factor(Soybean[,i], ordered = FALSE)
}



Soybean%>%
  pivot_longer(cols = everything(), names_to = "Measure", values_to = "Value") %>%
  ggplot(aes(x=Value)) + geom_bar() + facet_wrap(~Measure, scales = "free")

Part b

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Yes.

  • The class 2-4-d-injury is missing most values.
  • phytophthora-rot is missing around 75% of the entries in half the classes.
  • cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury are missing all obervations of many variables.
  • Any class not mentioned above is 100% complete.
Soybean_NA_Freq <- Soybean%>%
                      pivot_longer(cols = -Class, names_to = "Measure", values_to = "Value") %>%
                      filter(is.na(Value)) %>% 
                      count(Class, Measure)

Soybean_Total_by_class <- Soybean %>% count(Class)

names(Soybean_Total_by_class) <- c("Class", "Total_ct")

Soybean_NA_Freq <- inner_join(Soybean_NA_Freq,Soybean_Total_by_class )
## Joining, by = "Class"
Soybean_NA_Freq$missing_percent <- Soybean_NA_Freq$n / Soybean_NA_Freq$Total_ct

Soybean_NA_Freq <- Soybean_NA_Freq %>% select(Class, missing_percent, Measure)

Soybean_NA_Freq %>% pivot_wider(values_from = missing_percent, names_from = Class)
## # A tibble: 34 x 6
##    Measure   `2-4-d-injury` `cyst-nematode` `diaporthe-pod-&-s~ `herbicide-inju~
##    <chr>              <dbl>           <dbl>               <dbl>            <dbl>
##  1 area.dam          0.0625              NA                NA                 NA
##  2 canker.l~         1                    1                NA                  1
##  3 crop.hist         1                   NA                NA                 NA
##  4 date              0.0625              NA                NA                 NA
##  5 ext.decay         1                    1                NA                  1
##  6 fruit.po~         1                   NA                NA                 NA
##  7 fruit.sp~         1                    1                NA                  1
##  8 fruiting~         1                    1                NA                  1
##  9 germ              1                    1                 0.4                1
## 10 hail              1                    1                 1                  1
## # ... with 24 more rows, and 1 more variable: phytophthora-rot <dbl>

Part c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In most of these instances, I would treat NA as its own value, as they may all share a trait if this was not measured and the variables are already caegorical.

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of \(\alpha\) and \(\iota\) 0. Do you get the same values as the ses() function?

HA 7.1

Consider the pigs series — the number of pigs slaughtered in Victoria each month.

Part a

Use the ses() function in R to find the optimal values of \(\alpha\) and \(\iota\) 0 and generate forecasts for the next four months

pm <- ses(pigs)
pm$model
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665
forecast(pm)
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8
## Jan 1996       98816.41 83448.52 114184.3 75313.25 122319.6
## Feb 1996       98816.41 82955.06 114677.8 74558.56 123074.2
## Mar 1996       98816.41 82476.49 115156.3 73826.66 123806.2
## Apr 1996       98816.41 82011.54 115621.3 73115.58 124517.2
## May 1996       98816.41 81559.12 116073.7 72423.66 125209.2
## Jun 1996       98816.41 81118.26 116514.6 71749.42 125883.4

Part b

Compute a 95% prediction interval for the first forecast using mean(y) +- 1.96s, where s is the standard deviation of the residuals. Commpare your interveral with that produced by R.

They are close but not exact matches.

pm$mean[1] - 1.96*sd(pm$residuals) 
## [1] 78679.97
pm$mean[1] + 1.96*sd(pm$residuals) 
## [1] 118952.8

HA 7.3

Yet to finish

HA 8.1

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Part a

Explain the differences among these figures. Do they all indicate that the data are white noise?

As the error bars are within the dashed lines indicating significance level, we know they do indicate white noise.

Part b

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

These dashlines approach zero as the sample grows. As the number of observations grows, the critical value range grows smaller. This implies that it is statistically more likely for smaller samples to show an average other than the true mean.

HA 8.2

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

As the ACF does not quickly drop to zero, we know that this is not stationary.

ggtsdisplay(ibmclose)

# HA 8.6

Yet to finish

HA 8.8

Yet to finish