HA 2.1
Use the help function to explore what the series
gold, woolyrnq and gas
represent
gold: Daily morning gold prices in US dollars. 1 January
1985 – 31 March 1989.
woolyrnq: Quarterly production of woollen yarn in
Australia: tonnes. Mar. 1965 – Sep. 1994.
gas: Australian monthly gas production: 1956–1995.
a
use autoplot() to plot each of these in separate
plots.
### b
what is the frequency of each series? hint: apply the
frequency() function.
## [1] "Frequency of gold series: 1. Suggest: 365.25 for daily"
## [1] "Frequency of woolyrnq series: 4. Suggest: 4 for quarterly"
## [1] "Frequency of gas series: 12. Suggest: 12 for monthly"
c
use which.max() to spot the outlier in the
gold series. which observation was it?
## [1] "The outlier in gold series appears in observation 770"
HA 2.3
download some monthly Australian retail data from the book website. these represent retail sales in various categories for different Australian states, and are stored in a MS Excel file.
a
you can read the data into R with the following script:
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
the second argument(skip=1) is required because the
Excel sheet has two header rows.
## # A tibble: 6 × 190
## `Series ID` A3349335T A3349627V A3349338X A3349398A A3349468W
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-04-01 00:00:00 303. 41.7 63.9 409. 65.8
## 2 1982-05-01 00:00:00 298. 43.1 64 405. 65.8
## 3 1982-06-01 00:00:00 298 40.3 62.7 401 62.3
## 4 1982-07-01 00:00:00 308. 40.9 65.6 414. 68.2
## 5 1982-08-01 00:00:00 299. 42.1 62.6 404. 66
## 6 1982-09-01 00:00:00 305. 42 64.4 412. 62.3
## # … with 184 more variables: A3349336V <dbl>, A3349337W <dbl>, A3349397X <dbl>,
## # A3349399C <dbl>, A3349874C <dbl>, A3349871W <dbl>, A3349790V <dbl>,
## # A3349556W <dbl>, A3349791W <dbl>, A3349401C <dbl>, A3349873A <dbl>,
## # A3349872X <dbl>, A3349709X <dbl>, A3349792X <dbl>, A3349789K <dbl>,
## # A3349555V <dbl>, A3349565X <dbl>, A3349414R <dbl>, A3349799R <dbl>,
## # A3349642T <dbl>, A3349413L <dbl>, A3349564W <dbl>, A3349416V <dbl>,
## # A3349643V <dbl>, A3349483V <dbl>, A3349722T <dbl>, A3349727C <dbl>, …
b
select one of the time series as follows(but replace the column name with your own chosen column):
myts <- ts(retaildata[,"A3349873A"],
frequency=12, start=c(1982,4))
## Apr May Jun Jul Aug Sep
## 1982 91.8 102.6 105.0 106.0 96.9 97.5
c
explore your chosen retail time series using the following
functions:autoplot(), ggseasonplot(),
ggsubseriesplot(), gglagplot(),
ggAcf(), can you spot any seasonality, cyclicity and trend?
what do you learn about the series?
autoplot: there is clear increasing trend
seasonalplot: from Jan to the beginning of Feb, there is
increasingly decrease retail sales. It would be the reason which people
are in the budget because of the spending in Nov and Dec last year. Then
the sales increases slowly until peak in June, it could be summer store
up for outdoor activities. Sales drop down slowly afterwards need more
investigation. From Nov to Dec with the time closer to recent, every
year, people spend more than the year before. It would be the effect of
holidays, discounts and annual sales.
subseriesplot: the average purchasing power is monthly
stable from Jan to Nov, but it increases in the last month of the
year.
lagplot: the relationships on the lag plot are all
positive, lag12 has much stronger relationship than others.
autocorrelation: the correlations are significant, the
slighyly descrease in the plot as the lag increases is because of the
trend, also, the scalloped shape is due to the effect of
seasonality.
Conclusion:
- there is seasonal patterns
- there is no clear cycle patterns
- there is increasing trend
HA 6.2
The plastics data set consists of the monthly sales(in
thousands) of product A for a plastics manufacturer for five years.
a
plot the time series of sales of productA. can you identify seasonal fluctuations and/or trend-cycle?
The seasonal fluctuation increase from roughly the second month of the year to the eighth month of the year, then it decrease until the next year roughly the end of the first month. There is also an increasing trend.
### b
use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices
### c
do the results support the graphical interpretation from part a?
Yes.
d
compute and plot the seasonally adjusted data
### e
change one observation to be an outlier(e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of outlier?
the outlier is shown in the seasonal adjusted data, it affects the data around the outlier slighly, but it does not seem to affect the points far away from it.
### f
does it make any difference if the outlier is near the end rather than in the middle of the time series?
The outlier near the end only affect the data close to it. However, it seems like the outlier in the middle will affect all of data no matter close or futher away.
KJ 3.1
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
the data can be accessed via:
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
a
using visualizations, explore the predictor variables to understand their distributions as well as the relationship between predictors
From the distribution plot, There are skewed variables like
Ba, Fe and K, and
Al, Ca, Na, Ri and
Si are roughly normal. Mg seems bimodel.
From the correlation plot, if we determine threshold to be 0.8, only
Ca and Ri are strongly correlated. The rest of
predictor variables do not seem to have relationships.
## [1] "Suggested high correlated column for deletion is 7"
b
do there appear to be any outliers in the data? are any predictors skewed?
Yes, boxplot helps to show outliers. Predictor variables like
Ba, Fe, K and Mg are
skewed
c
are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, log transformation is suitable for right skewed variables. Box-Cox transformation can apply to all the variables, it will help to find appropriate transformations.
KJ 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
a
investigate the frequency distributions for the categorical predictors. are any of the distributions degenerate in the ways discussed earlier in this chapter?
## [1] "the distributions degenerate in the way discussed in the chapter correspond to variables with column 18"
## [2] "the distributions degenerate in the way discussed in the chapter correspond to variables with column 25"
## [3] "the distributions degenerate in the way discussed in the chapter correspond to variables with column 27"
b
roughly 18% of the data are missing. are there particular predictors that are more likely to be missing? is the pattern of missing data related to the class
As the table shows, variable hail, lodging, seed.tmt and server are more likely to be missing.
## # A tibble: 35 × 2
## var n_miss
## <chr> <int>
## 1 hail 121
## 2 lodging 121
## 3 seed.tmt 121
## 4 sever 121
## 5 germ 112
## 6 leaf.mild 108
## 7 fruit.spots 106
## 8 fruiting.bodies 106
## 9 seed.discolor 106
## 10 shriveling 106
## # … with 25 more rows
From the summary of the table, there are 5 main classes associated with the missing value.
## # A tibble: 5 × 1
## Class
## <fct>
## 1 2-4-d-injury
## 2 cyst-nematode
## 3 diaporthe-pod-&-stem-blight
## 4 herbicide-injury
## 5 phytophthora-rot
c
develop a strategy for handling missing data, either by eliminating predictors or imputation
Consider the data set only contains 683 observations, row elimination will be performed on variables with only one missing value, the rest of data, I think it would be better to impute the data to ensure the integrity.
HA 7.1
consider the pigs series – the number of pigs
slaughtered in Victoria each month.
a
use the ses() function in R to find the optimal
values of \(\alpha\) and \(l_0\), and generate forecasts for the next
four monthsU
pig_mod <- pigs %>% ses(h = 4)
pig_mod$model
## Simple exponential smoothing
##
## Call:
## ses(y = ., h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
b
compute a 95% prediction interval for the first forecast using \(\hat{y} \pm\) 1.96\(s\) where \(s\) is the standard deviation of the residuals. compare your interval with interval produced by R
the interval calculated by hand is a little bit smaller than the one produced by R.
## [1] "the interval calculated by hand: (78679.967, 118952.845)"
## [1] "the interval produced by R: (78611.968, 119020.844)"
HA 7.2
write your own function to implement simple exponential
smoothing. the function should take arguments y(the time
series), alpha (the smoothing parameter \(\alpha\)) and level (the
initial level \(l_0\)). it should
return the forecast of the next observation in the series. Does it give
the same forecasts as ses()?
compare the result from function I’ve created and the result model produce, if I use the exact same parameters and time series, the result is going to be the same.
ses_cast <- function(y, a, l){
y_hat = 0
y_obs = length(y)
for(a_pow in seq(length(y))){
if(y_obs > 0){
y_hat = y_hat + a * (1-a)^(a_pow-1) * y[y_obs]
y_obs = y_obs - 1
}
}
return(y_hat + (1-a)^188*l)
}
## [1] "value calculated by created function is 98816.4061115907"
## [1] "value from the model is 98816.4061115907"
HA 7.3
Modify your function from the previous exercise to return the
sum of squared errors rather than the forecast of the next observation.
then use the optim() function to find the optimal values of
\(\alpha\) and \(l_0\). Do you get the same value as the
ses() function?
The \(\alpha\) is approximate 0.299,
initial state \(l_0\) is approximate
76379.27. The ses() function provides \(\alpha\) is 0.2971 and initial state \(l\) is 77260.0561. They are not exactly the
same, but pretty close.
sse_cast <- function(pars = c(a, l), y){
df <- data.frame(time = integer(length(y)), observation = numeric(length(y)), level = numeric(length(y)), forecasts = numeric(length(y)))
l0 <- pars[2]
alpha <- pars[1]
for(i in seq(length(y))){
df[i, "time"] <- i
df[i, "observation"] <- y[i]
df[i, "level"] <- alpha * df[i, "observation"] + (1-alpha)*l0 # calculate the first forecast
df[i, "forecasts"] <- l0 # initial forecast set to initial state
l0 <- df[i, "level"] # update initial state to the first forecast
}
return(sum((df$observation - df$forecasts)^2))
}
## $par
## [1] 2.990081e-01 7.637927e+04
##
## $value
## [1] 19767049111
##
## $counts
## function gradient
## 91 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
HA 8.1
Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.
a
explain the differences among these figures. do they all indicate that the data are white noise?
the autocorrelation in x1 is greater than x2, and x2 is greater than x3.
the confidence interval in x1 is wider than x2, and x2 is wider than x3.
they all indicate that the data are white noise because most of spikes are in between the blue dashed line.
b
why are the critical values at different distances from the mean of zero? why are the autocorrelations different in each figure when they each refer to white noise?
critical values at different distances from the mean of zero is because the white noise is expected to close to zero and within \(\pm2\sqrt{T}\) where \(T\) is the length of the series. the length of x1, x2 and x3 are different, the distance will be different.
autocorrelations different in each figure is because the size of data is different. as white noise, they are assumed to be random. the larger data, the smaller autocorrelation, the narrower confidence interval, and vise versa.
HA 8.2
a classic example of non-stationary series is the daily
closing IBM stock price series (data set ibmclose). use R
to plot the daily closing prices for IBM stock and the ACF and PACF.
Explain how each plot shows that the series is non-stationary and should
be differenced.
from the overview of IBM daily closing plot, it seems have trend
ACF, the lags show in the plot are greater than the critical value which means that they are not independent
PACF, only the first lag is significant from zero, which means each point correlated to the point right before it.
in conclusion, the series is non-stationary and should be differenced.
HA 8.6
use R to simulate and plot some dara from simple ARIMA models.
a
use the following R code to generate data from an AR(1) model with \(\phi_1\) = 0.6 and \(\sigma^2\) = 1. the process starts with \(y_1\) = 0
set.seed(100)
y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1]+e[i]
b
produce a time plot for the series. how does the plot change as you change \(\phi_1\)?
the greater \(\phi_1\), the smoother the series is.
### c
write your own code to generate data from MA(1) model with \(\theta_1\) = 0.6 and \(\sigma^2\) = 1
theta <- function(t){
e[1] <- 0
for(i in 2:100){
y[i] <- e[i] + t*e[i-1]
}
return(y)
}
d
produce a time plot for the series. how does the plot changes as you change \(\theta_1\)?
the greater the value of \(\theta\), the smoother the series is.
### e
generate data from an ARIMA(1,1) model with \(\phi_1\)= -0.8, \(\theta_1\) = 0.6 and \(\sigma^2\) = 1
gen_arima <- function(phi, theta){
e[1] <- 0
for(i in 2:100)
y[i] <- phi*y[i-1]+theta*e[i-1] + e[i]
return(y)
}
f
generate data from an AR(2) model with \(\phi_1\) = -0.8, \(\phi_2\) = 0.3 and \(\sigma^2\) = 1. (note that these parameters will give a non-stationary series.)
gen_ar <- function(phi1, phi2){
for(i in 3:100)
y[i] <- phi1*y[i-1]+phi2*y[i-2] + e[i]
return(y)
}
g
graph the latter two series and compare them
clearly from the graph, the variance of AR(2) is increasing exponentially. the variance of ARIMA(1,1) seems pretty constant, it could be white noise.
HA 8.8
consider austa, the total international visitors to
Australia (in millions) for the period 1980-2015.
a
use auto.arima() to find an appropriate ARIMA
model. what model was selected. check that the residuals look like white
noise. plot forecasts for the next 10 period.
the auto.arima() function helps select MA(1) model with
one time differencing. from the residual plot, all the lags is between
the double blue dashed line, which indicates that the residual is white
noise. consider the frequency is 1, therefore, it is yearly series.
therefore, the next 10 forecasts is shown below.
## Series: austa
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.3006 0.1735
## s.e. 0.1647 0.0390
##
## sigma^2 = 0.03376: log likelihood = 10.62
## AIC=-15.24 AICc=-14.46 BIC=-10.57
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
##
## Model df: 2. Total lags used: 7
### b
plot forecasts from ARIMA(0,1,1) model with no drift and compare these to part a. remove the MA term and plot again.
compare both plots and pay attention to the scale, the ARIMA(0,1,0) have narrower confidence interval.
### c
plot forecats from an ARIMA(2,1,3) model with drift. remove the constant and see what happen
there is no different in ARIMA(2,1,3) model with constant or without constant
### d
plot forecasts from ARIMA(0,0,1) model with a constant. remove the MA term and plot again
ARIMA(0,0,1) decreases and converges to the mean of past value, while the forecasts of ARIMA(0,0,0) is the naive exponential smoothing where the weight are the same whose forecasts are mean.
### e
plot forecasts from an ARIMA(0,2,1) model with no constant
the confidence interval is increasing with the number periods of forecasts
Arima(austa, order = c(0,2,1), method = "ML", include.constant = FALSE) %>% forecast(h=10) %>% autoplot()