About dataset

Our group will take the dataset about Milk production in lbs per cow per month Jan 1962 - Dec 1975. First of all, we download the dataset from website (https://www.york.ac.uk/depts/maths/data/ts/) and import it for this final project. We done that x is the Production per cow per month Jan 1962 - Dec 1975

We import these packages: install.packages(“tidyverse”) install.packages(“psych”) install.packages(“quantmod”) install.packages(“forecast”)

MilkProduction <- read.table("C:/Users/LAPTOP-LC/Downloads/MilkProduction.txt",header = TRUE)
attach(MilkProduction)
names(MilkProduction)
## [1] "Production"
x <- Production

Next, we will see the head and check the length of the dataset.

head(MilkProduction)
##   Production
## 1        589
## 2        561
## 3        640
## 4        656
## 5        727
## 6        697
length(x)
## [1] 168

We observe that this dataset has 168 records and it is fit with the time series for Month from Jan 1962 - Dec 1975.

library(psych)
## Warning: package 'psych' was built under R version 4.2.2
describe(x)
##    vars   n   mean    sd median trimmed    mad min max range skew kurtosis   se
## X1    1 168 754.71 102.2    761  754.45 108.23 553 969   416 0.01    -0.81 7.89
summary(MilkProduction)
##    Production   
##  Min.   :553.0  
##  1st Qu.:677.8  
##  Median :761.0  
##  Mean   :754.7  
##  3rd Qu.:824.5  
##  Max.   :969.0
sum(is.na(MilkProduction))
## [1] 0

Now, we see some infomations about this dataset. Because this dataset doesn’t contain NA value so that we don’t need to clean it. Next, we plot graph about this dataset for descriptive analysis.

boxplot(x, horizontal=FALSE, main="Box plot of Milk Production", ylab="Milk Production in lbs", col = "pink") 

It is clear to see that in the boxplot of Milk Production, it doesn’t contain any outliers. So next, we plot the graph to see the dataset.

reg = lm(x~seq(length(x)))
plot(seq(length(x)),x,col="pink",xlab="Time series",ylab="Milk Production per cow per month",lwd=2,pch=20,type="l")
abline(reg)

Looking on this graph, this data is non-stationary time series because mean is changed over the time. However, the standard deviation of this time series is quite constant.

Next, before applied ARIMA model, the time series have to be a stationary. We differencing the time series

ARIMA model

1.1 ARIMA without deseasonalize

Differece with lag = 1

diff1 = diff(x,lag=1)
plot(seq(167),diff1,type="l")

Test stationary time series with Augmented Dickey Fuller test

install.packages(“tseries”)

library(tseries)
## Warning: package 'tseries' was built under R version 4.2.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -9.5083, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
acf(diff1,lag.max=30,plot=TRUE)

=> q = 4

acf(diff1,lag.max=30,type="partial",plot=TRUE)

=> p = 1

In order to choose the optimize order = c(p,d,q), we using auto.arima function to find.

library(forecast)
auto.arima(x)
## Series: x 
## ARIMA(1,1,4) 
## 
## Coefficients:
##           ar1     ma1     ma2      ma3      ma4
##       -0.3045  0.2456  0.1500  -0.4257  -0.6493
## s.e.   0.1158  0.0816  0.0545   0.0486   0.0614
## 
## sigma^2 = 1380:  log likelihood = -839.88
## AIC=1691.77   AICc=1692.29   BIC=1710.48
A = auto.arima(x)
A= as.data.frame(A[7])
p= A[1,1]
q= A[2,1]
P= A[3,1]
Q= A[4,1]
d= A[6,1]
D= A[7,1]
arima <- arima(x,order=c(p,d,q))
arima
## 
## Call:
## arima(x = x, order = c(p, d, q))
## 
## Coefficients:
##           ar1     ma1     ma2      ma3      ma4
##       -0.3045  0.2456  0.1500  -0.4257  -0.6493
## s.e.   0.1158  0.0816  0.0545   0.0486   0.0614
## 
## sigma^2 estimated as 1339:  log likelihood = -839.88,  aic = 1691.77
sarima <- arima(x,order=c(1,1,4),seasonal = list(order = c(0, 1, 1), period = 12))
sarima
## 
## Call:
## arima(x = x, order = c(1, 1, 4), seasonal = list(order = c(0, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     ma2     ma3      ma4     sma1
##       0.4086  -0.6411  0.1304  0.1264  -0.1974  -0.6176
## s.e.  0.3885   0.3851  0.1366  0.1074   0.0933   0.0644
## 
## sigma^2 estimated as 50.94:  log likelihood = -527.57,  aic = 1069.13

Evaluate model

library(forecast)
checkresiduals(sarima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,4)(0,1,1)[12]
## Q* = 2.0231, df = 4, p-value = 0.7315
## 
## Model df: 6.   Total lags used: 10

The residuals of SARIMA model is suitable and model is good enough to predict.

1.2 ARIMA with seasonalized

Differencing the time series at the seasonal lag

We set lag = 12 because time series follow by month

diff12 <- diff(x,lag=12)
plot(diff12,main="Deseasonal time series",type="l")

It seems to be a non-stationary time series. Our group are going to do the adf test to check the stationary.

library(tseries)
adf.test(diff12)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff12
## Dickey-Fuller = -2.8615, Lag order = 5, p-value = 0.2172
## alternative hypothesis: stationary

The diff12 time series is non-stationary so we are going to make it stationary by differencing with lag = 1.

Difference time series after deseasonalize.
y = diff(diff12,lag=1)
plot(seq(155),y,type="l")

Augmented Dickey-Fuller Test
library(tseries)
adf.test(y)
## Warning in adf.test(y): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  y
## Dickey-Fuller = -5.8483, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

p-value = 0.01 => stationary time series After differencing with lag = 1 for deseasonalize time series, we got stationary time series y.

Plot ACF and PACF

acf(y,lag.max=30,plot=TRUE)

acf(y,lag.max=10,type="partial",plot=TRUE)

In order to choose the optimize order = c(p,d,q), we using auto.arima function to find.

library(forecast)
auto.arima(diff12)
## Series: diff12 
## ARIMA(2,1,2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2
##       -0.3418  -0.8660  0.1841  0.9564
## s.e.   0.0538   0.0542  0.0334  0.0536
## 
## sigma^2 = 69.59:  log likelihood = -547.69
## AIC=1105.37   AICc=1105.78   BIC=1120.59
B = auto.arima(diff12)
B= as.data.frame(B[7])
p1= B[1,1]
q1= B[2,1]
P1= B[3,1]
Q1= B[4,1]
d1= B[6,1]
D1= B[7,1]
pred_y = arima(diff12,order=c(p1,d1,q1))

Next, we check residual of pred_y model.

library(forecast)
checkresiduals(pred_y)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 7.8527, df = 6, p-value = 0.2491
## 
## Model df: 4.   Total lags used: 10

Residuals of pred_y model is also suitable and model is good enough.

Compare 2 model

SARIMA model has sigma^2 is 50.94 and aic is 1069.13 pred_y model has sigma^2 is 69.59 and aic is 1105.37

By comparing 2 model sarima and pred_y, we decide to choose sarima model which is better than with lower sigma^2 and lower aic

Forecasting

autoplot(forecast(sarima))