library(fpp2)
library(seasonal)
library(tidyverse)
library(reshape2)
library(corrplot)
Use the help function to explore what the series gold, woolyrnq and gas represent.
help(gold)
## starting httpd help server ... done
help(woolyrnq)
help(gas)
Use autoplot() to plot each of these in separate plots.
autoplot(gold)
autoplot(woolyrnq)
autoplot(gas)
What is the frequency of each series? Hint: apply the frequency() function.
frequency(gold)
## [1] 1
frequency(woolyrnq)
## [1] 4
frequency(gas)
## [1] 12
Use which.max() to spot the outlier in the gold series. Which observation was it?
which.max(gold)
## [1] 770
Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.
You can read the data into R with the following script:
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
#summary(retaildata)
Select one of the time series as follows (but replace the column name with your own chosen column):
myts <- ts(retaildata[,"A3349788J"],
frequency=12, start=c(1982,4))
Explore your chosen retail time series using the following functions:
autoplot(myts)
ggseasonplot(myts)
ggsubseriesplot(myts)
gglagplot(myts)
ggAcf(myts)
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
It is absiolutely seasonal and cyclical, peaking in December after a steady rise starting in October. It steeply drops down in January.
The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
plastics
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 742 697 776 898 1030 1107 1165 1216 1208 1131 971 783
## 2 741 700 774 932 1099 1223 1290 1349 1341 1296 1066 901
## 3 896 793 885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4 951 861 938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013
autoplot(plastics)
Sales are significantly better in the summer and early fall months: May to October
Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
plastics %>% decompose(type="multiplicative") %>%
autoplot() + xlab("Year") +
ggtitle("Classical multiplicative decomposition
of plastics")
Do the results support the graphical interpretation from part a? Yes
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series?
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Glass_Long <- pivot_longer(Glass, names_to = "Var",values_to = "Val", cols = -Type)
#%>%
# ggplot(x=Val) + geom_histogram()
ggplot(data = melt(Glass), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
## Using Type as id variables
corrplot.mixed(cor(Glass[,-10]))
Do there appear to be any outliers in the data? Are any predictors skewed?
Yes, great deal of outliers. Most notably in Ba, Ca, Fe.
ggplot(data = melt(Glass), aes(x = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free")
## Using Type as id variables
Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, see below:
melt.glass <- melt(Glass)
## Using Type as id variables
# CA
melt.glass.Ca.old <- melt.glass %>%
filter(variable == "Ca")
melt.glass.Ca.old$age <- "Old"
melt.glass.Ca.new <- melt.glass.Ca.old
melt.glass.Ca.new$age <- "New"
melt.glass.Ca.new$value <- melt.glass.Ca.old$value^2
melt.glass.Ca <- rbind(melt.glass.Ca.new, melt.glass.Ca.old)
ggplot(data = melt.glass.Ca, aes(x = value)) +
stat_density() +
facet_wrap(~age)
ggplot(data = melt.glass.Ca, aes(x = value)) +
geom_boxplot() +
facet_wrap(~age)
# Ba
melt.glass.Ba.old <- melt.glass %>%
filter(variable == "Ba")
melt.glass.Ba.old$age <- "Old"
melt.glass.Ba.new <- melt.glass.Ba.old
melt.glass.Ba.new$age <- "New"
melt.glass.Ba.new$value <- log(melt.glass.Ba.old$value)
melt.glass.Ba <- rbind(melt.glass.Ba.new, melt.glass.Ba.old)
ggplot(data = melt.glass.Ba, aes(x = value)) +
stat_density() +
facet_wrap(~age)
## Warning: Removed 176 rows containing non-finite values (stat_density).
ggplot(data = melt.glass.Ba, aes(x = value)) +
geom_boxplot() +
facet_wrap(~age)
## Warning: Removed 176 rows containing non-finite values (stat_boxplot).
# Fe
melt.glass.Fe.old <- melt.glass %>%
filter(variable == "Fe")
melt.glass.Fe.old$age <- "Old"
melt.glass.Fe.new <- melt.glass.Fe.old
melt.glass.Fe.new$age <- "New"
melt.glass.Fe.new$value <- log(melt.glass.Fe.old$value)
melt.glass.Fe <- rbind(melt.glass.Fe.new, melt.glass.Fe.old)
ggplot(data = melt.glass.Fe, aes(x = value)) +
stat_density() +
facet_wrap(~age)
## Warning: Removed 144 rows containing non-finite values (stat_density).
ggplot(data = melt.glass.Fe, aes(x = value)) +
geom_boxplot() +
facet_wrap(~age)
## Warning: Removed 144 rows containing non-finite values (stat_boxplot).
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. http://archive.ics.uci.edu/ml/index.html.
library(mlbench)
data(Soybean)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
for (i in 3:ncol(Soybean)){
Soybean[,i] <- factor(Soybean[,i], ordered = FALSE)
}
Soybean%>%
pivot_longer(cols = everything(), names_to = "Measure", values_to = "Value") %>%
ggplot(aes(x=Value)) + geom_bar() + facet_wrap(~Measure, scales = "free")
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Yes.
Soybean_NA_Freq <- Soybean%>%
pivot_longer(cols = -Class, names_to = "Measure", values_to = "Value") %>%
filter(is.na(Value)) %>%
count(Class, Measure)
Soybean_Total_by_class <- Soybean %>% count(Class)
names(Soybean_Total_by_class) <- c("Class", "Total_ct")
Soybean_NA_Freq <- inner_join(Soybean_NA_Freq,Soybean_Total_by_class )
## Joining, by = "Class"
Soybean_NA_Freq$missing_percent <- Soybean_NA_Freq$n / Soybean_NA_Freq$Total_ct
Soybean_NA_Freq <- Soybean_NA_Freq %>% select(Class, missing_percent, Measure)
Soybean_NA_Freq %>% pivot_wider(values_from = missing_percent, names_from = Class)
## # A tibble: 34 x 6
## Measure `2-4-d-injury` `cyst-nematode` `diaporthe-pod-&-s~ `herbicide-inju~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 area.dam 0.0625 NA NA NA
## 2 canker.l~ 1 1 NA 1
## 3 crop.hist 1 NA NA NA
## 4 date 0.0625 NA NA NA
## 5 ext.decay 1 1 NA 1
## 6 fruit.po~ 1 NA NA NA
## 7 fruit.sp~ 1 1 NA 1
## 8 fruiting~ 1 1 NA 1
## 9 germ 1 1 0.4 1
## 10 hail 1 1 1 1
## # ... with 24 more rows, and 1 more variable: phytophthora-rot <dbl>
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
In most of these instances, I would treat NA as its own value, as they may all share a trait if this was not measured and the variables are already caegorical.
Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of \(\alpha\) and \(\iota\) 0. Do you get the same values as the ses() function?
Consider the pigs series — the number of pigs slaughtered in Victoria each month.
Use the ses() function in R to find the optimal values of \(\alpha\) and \(\iota\) 0 and generate forecasts for the next four months
pm <- ses(pigs)
pm$model
## Simple exponential smoothing
##
## Call:
## ses(y = pigs)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
forecast(pm)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995 98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995 98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995 98816.41 83958.37 113674.4 76092.99 121539.8
## Jan 1996 98816.41 83448.52 114184.3 75313.25 122319.6
## Feb 1996 98816.41 82955.06 114677.8 74558.56 123074.2
## Mar 1996 98816.41 82476.49 115156.3 73826.66 123806.2
## Apr 1996 98816.41 82011.54 115621.3 73115.58 124517.2
## May 1996 98816.41 81559.12 116073.7 72423.66 125209.2
## Jun 1996 98816.41 81118.26 116514.6 71749.42 125883.4
Compute a 95% prediction interval for the first forecast using mean(y) +- 1.96s, where s is the standard deviation of the residuals. Commpare your interveral with that produced by R.
They are close but not exact matches.
pm$mean[1] - 1.96*sd(pm$residuals)
## [1] 78679.97
pm$mean[1] + 1.96*sd(pm$residuals)
## [1] 118952.8
Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
Explain the differences among these figures. Do they all indicate that the data are white noise?
As the error bars are within the dashed lines indicating significance level, we know they do indicate white noise.
Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
These dashlines approach zero as the sample grows. As the number of observations grows, the critical value range grows smaller. This implies that it is statistically more likely for smaller samples to show an average other than the true mean.
A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
As the ACF does not quickly drop to zero, we know that this is not stationary.
ggtsdisplay(ibmclose)
# HA 8.6