Our group will take the dataset about Milk production in lbs per cow per month Jan 1962 - Dec 1975. First of all, we download the dataset from website (https://www.york.ac.uk/depts/maths/data/ts/) and import it for this final project. We done that x is the Production per cow per month Jan 1962 - Dec 1975
We import these packages: install.packages(“tidyverse”) install.packages(“psych”) install.packages(“quantmod”) install.packages(“forecast”)
MilkProduction <- read.table("C:/Users/LAPTOP-LC/Downloads/MilkProduction.txt",header = TRUE)
attach(MilkProduction)
names(MilkProduction)
## [1] "Production"
x <- Production
Next, we will see the head and check the length of the dataset.
head(MilkProduction)
## Production
## 1 589
## 2 561
## 3 640
## 4 656
## 5 727
## 6 697
length(x)
## [1] 168
We observe that this dataset has 168 records and it is fit with the time series for Month from Jan 1962 - Dec 1975.
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
describe(x)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 168 754.71 102.2 761 754.45 108.23 553 969 416 0.01 -0.81 7.89
summary(MilkProduction)
## Production
## Min. :553.0
## 1st Qu.:677.8
## Median :761.0
## Mean :754.7
## 3rd Qu.:824.5
## Max. :969.0
sum(is.na(MilkProduction))
## [1] 0
Now, we see some infomations about this dataset. Because this dataset doesn’t contain NA value so that we don’t need to clean it. Next, we plot graph about this dataset for descriptive analysis.
boxplot(x, horizontal=FALSE, main="Box plot of Milk Production", ylab="Milk Production in lbs", col = "pink")
It is clear to see that in the boxplot of Milk
Production, it doesn’t contain any outliers. So next, we plot
the graph to see the dataset.
reg = lm(x~seq(length(x)))
plot(seq(length(x)),x,col="pink",xlab="Time series",ylab="Milk Production per cow per month",lwd=2,pch=20,type="l")
abline(reg)
Looking on this graph, this data is non-stationary time series because mean is changed over the time. However, the standard deviation of this time series is quite constant.
Next, before applied ARIMA model, the time series have to be a stationary. We differencing the time series
diff1 = diff(x,lag=1)
plot(seq(167),diff1,type="l")
install.packages(“tseries”)
library(tseries)
## Warning: package 'tseries' was built under R version 4.2.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff1
## Dickey-Fuller = -9.5083, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
acf(diff1,lag.max=30,plot=TRUE)
=> q = 4
acf(diff1,lag.max=30,type="partial",plot=TRUE)
=> p = 1
In order to choose the optimize order = c(p,d,q), we using auto.arima function to find.
library(forecast)
auto.arima(x)
## Series: x
## ARIMA(1,1,4)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4
## -0.3045 0.2456 0.1500 -0.4257 -0.6493
## s.e. 0.1158 0.0816 0.0545 0.0486 0.0614
##
## sigma^2 = 1380: log likelihood = -839.88
## AIC=1691.77 AICc=1692.29 BIC=1710.48
A = auto.arima(x)
A= as.data.frame(A[7])
p= A[1,1]
q= A[2,1]
P= A[3,1]
Q= A[4,1]
d= A[6,1]
D= A[7,1]
arima <- arima(x,order=c(p,d,q))
arima
##
## Call:
## arima(x = x, order = c(p, d, q))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4
## -0.3045 0.2456 0.1500 -0.4257 -0.6493
## s.e. 0.1158 0.0816 0.0545 0.0486 0.0614
##
## sigma^2 estimated as 1339: log likelihood = -839.88, aic = 1691.77
sarima <- arima(x,order=c(1,1,4),seasonal = list(order = c(0, 1, 1), period = 12))
sarima
##
## Call:
## arima(x = x, order = c(1, 1, 4), seasonal = list(order = c(0, 1, 1), period = 12))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## 0.4086 -0.6411 0.1304 0.1264 -0.1974 -0.6176
## s.e. 0.3885 0.3851 0.1366 0.1074 0.0933 0.0644
##
## sigma^2 estimated as 50.94: log likelihood = -527.57, aic = 1069.13
library(forecast)
checkresiduals(sarima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,4)(0,1,1)[12]
## Q* = 2.0231, df = 4, p-value = 0.7315
##
## Model df: 6. Total lags used: 10
The residuals of SARIMA model is suitable and model is good enough to predict.
We set lag = 12 because time series follow by month
diff12 <- diff(x,lag=12)
plot(diff12,main="Deseasonal time series",type="l")
It seems to be a non-stationary time series. Our group are going to do the adf test to check the stationary.
library(tseries)
adf.test(diff12)
##
## Augmented Dickey-Fuller Test
##
## data: diff12
## Dickey-Fuller = -2.8615, Lag order = 5, p-value = 0.2172
## alternative hypothesis: stationary
The diff12 time series is non-stationary so we are going to make it stationary by differencing with lag = 1.
y = diff(diff12,lag=1)
plot(seq(155),y,type="l")
library(tseries)
adf.test(y)
## Warning in adf.test(y): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: y
## Dickey-Fuller = -5.8483, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
p-value = 0.01 => stationary time series After differencing with lag = 1 for deseasonalize time series, we got stationary time series y.
acf(y,lag.max=30,plot=TRUE)
acf(y,lag.max=10,type="partial",plot=TRUE)
In order to choose the optimize order = c(p,d,q), we using auto.arima function to find.
library(forecast)
auto.arima(diff12)
## Series: diff12
## ARIMA(2,1,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## -0.3418 -0.8660 0.1841 0.9564
## s.e. 0.0538 0.0542 0.0334 0.0536
##
## sigma^2 = 69.59: log likelihood = -547.69
## AIC=1105.37 AICc=1105.78 BIC=1120.59
B = auto.arima(diff12)
B= as.data.frame(B[7])
p1= B[1,1]
q1= B[2,1]
P1= B[3,1]
Q1= B[4,1]
d1= B[6,1]
D1= B[7,1]
pred_y = arima(diff12,order=c(p1,d1,q1))
Next, we check residual of pred_y model.
library(forecast)
checkresiduals(pred_y)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 7.8527, df = 6, p-value = 0.2491
##
## Model df: 4. Total lags used: 10
Residuals of pred_y model is also suitable and model is good enough.
SARIMA model has sigma^2 is 50.94 and aic is 1069.13 pred_y model has sigma^2 is 69.59 and aic is 1105.37
By comparing 2 model sarima and pred_y, we decide to choose sarima model which is better than with lower sigma^2 and lower aic
autoplot(forecast(sarima))