Our group will take the dataset about Car sales in Quebec (thousands) Jan 1960 - Dec 1968. First of all, we download the dataset from website (https://www.york.ac.uk/depts/maths/data/ts/) and import it for this final project.
Cars<-read.csv("E:/dataR/carsale.csv")
attach(Cars)
names(Cars)
## [1] "Year" "Month" "Sales"
length(Sales)
## [1] 108
The column Sales shows the total number of cars sold. Dataset consists of the records from Jan 1960 to Nov 1968. There are total 108 observations.
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
describe(Sales)
## vars n mean sd median trimmed mad min max range skew
## X1 1 108 14511.78 4690.21 14076 14456.4 4612.37 1015 26099 25084 0.12
## kurtosis se
## X1 -0.32 451.32
summary(Cars)
## Year Month Sales
## Min. :1960 Min. : 1.00 Min. : 1015
## 1st Qu.:1962 1st Qu.: 3.75 1st Qu.:11391
## Median :1964 Median : 6.50 Median :14076
## Mean :1964 Mean : 6.50 Mean :14512
## 3rd Qu.:1966 3rd Qu.: 9.25 3rd Qu.:17596
## Max. :1968 Max. :12.00 Max. :26099
sum(is.na(Cars))
## [1] 0
Here are some statistics related to this dataset.
We don’t need to clean this dataset because it doesn’t contain NA values. We then create a graph for descriptive analysis of this dataset.
boxplot(Sales, horizontal=FALSE, main="Box plot of Cars", ylab="Car sales", col = "red")
As we observe non stationarity due to upwards trend, we then plot ACF & PACF respectively.
cars.ts <- matrix(Sales,nrow=108,ncol=1)
cars.ts<- as.vector(t(cars.ts))
cars.ts <- ts(cars.ts,start=c(1960,1), end=c(1968,12), frequency=12)
plot(cars.ts,type='o',ylab='Car Sales')
ACF plot for the data and PACF plot of the data
par(mfrow=c(1,2))
acf(cars.ts,lag.max = 36)
pacf(cars.ts, lag.max = 36)
The seasonal ARIMA model is :
\[ARIMA(p, d, q)\times (P, D, Q)s\]
with:
p = non-seasonal AR order
d = non-seasonal differencing
q = non-seasonal MA order
P = seasonal AR order
D = seasonal differencing
Q = seasonal MA order and s is time span of repeating seasonal pattern.
Specification of seasonal part (D=1)
Firstly, we do seasonal differencing to get rid of seasonal trend and fitting a plain model until time series and ACF/PACF plots of residuals show no sign of seasonality.
Then we determine the orders of P & Q of the seasonal part based on final ACF/PACF plots of residuals.
We do so by fitting the ARIMA(0, 0, 0) x (0, 1, 0) model and plotting the graphs.
Although the general upward trend is resolved, we plot the ACF & PACF.
m1.cars = arima(cars.ts,order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res.m1 = residuals(m1.cars);
par(mfrow=c(1,1))
plot(res.m1,xlab='Time',ylab='Residuals', col="blue", main="Time series plot of 1st seasonal difference")
ACF & PACF plot
The ACF/PACF plots show seasonal trend is filtered out.
ACF & PACF shows 1 seasonal lag , indicates SARMA(1,1)
ACF/PACF plot of the residuals
par(mfrow=c(1,2))
acf(res.m1, lag.max = 36)
pacf(res.m1, lag.max = 36)
Specification of seasonal part (D=1, P=1, Q=1)
The general upward trend is no longer seen. We plot the residuals ACF & PACF plots.
Time series plot with seasonal AR & MA coefficient (P=1,Q=1)
m2.cars = arima(cars.ts,order=c(0,0,0),seasonal=list(order=c(1,1,1), period=12))
res.m2 = residuals(m2.cars);
par(mfrow=c(1,1))
plot(res.m2,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")
ACF & PACF plots
On the seasonal lags, autocorrelation is still present. In order to remove seasonality, the process must be repeated with a higher number of Q.
ACF/PACF plot of the residuals with AR & MA coefficient (P=1,D=1,Q=1)
par(mfrow=c(1,2))
acf(res.m2, lag.max = 36)
pacf(res.m2, lag.max = 36)
Specification of seasonal part (D=1, P=1, Q=2)
Seasonal lags are no longer an issue. Thus, at P=1, D=1, and Q=2, the seasonality specification is complete. We can also notice that the regular pane has a number of lags.
We apply transformation first and look for any substantial lags because we don’t notice any trend.
ACF/PACF plot of the residuals with P=1,D=1,Q=2
m3.cars = arima(cars.ts,order=c(0,0,0),seasonal=list(order=c(1,1,2), period=12))
res.m3 = residuals(m3.cars)
par(mfrow=c(1,2))
acf(res.m3, lag.max = 36)
pacf(res.m3, lag.max = 36)
Data transformation are important tools for proper statistical analysis.
The time series data was subjected to a log transformation before being graphed.
Time series plot with transformed data
log.cars.ts = log(cars.ts)
par(mfrow=c(1,1))
plot(log.cars.ts,ylab='log of sales count',xlab='Year',type='o', col="dark green")
We fit the model with log transformed data and draw conclusions from sample ACF & PACF’s Time series plot of the residuals after transformation
m4.cars = arima(log.cars.ts,order=c(0,0,0),seasonal=list(order=c(1,1,2), period=12))
res.m4 = residuals(m4.cars)
plot(res.m4,xlab='Time',ylab='Residuals',main="Time series plot of the residuals", col="red")
The ACF plot shows two significant lags before 1st seasonal lag. PACF shows 2 significant lags.
ACF plot of the residuals after transformation and PACF plot of the residuals after transformation
par(mfrow=c(1,2))
acf(res.m4, lag.max = 36)
pacf(res.m4, lag.max = 36)
Non-seasonal Differencing
The creation of a list of potential models is the primary goal here. Therefore, let’s begin with d=1.
We begin with simple differencing with d=1 to remove any lingering trend and correlation in ACF/PACF plots because we don’t observe any decaying pattern.
We gain something from differentiation. It aids in the development of a model.
ACF/PACF plot of the residuals with ordinary differencing
# SARIMA(0,1,0)x(1,1,2)
m5.cars = arima(log.cars.ts,order=c(0,1,0),seasonal=list(order=c(1,1,2), period=12))
res.m5 = residuals(m5.cars)
par(mfrow=c(1,2))
acf(res.m5, lag.max = 36)
pacf(res.m5, lag.max = 36)
At the first lag, we obtain a high correlation and also note a small number of significant lags. The intervention point is to blame for everything.
We come up with MA(3 or 4)and AR(3) from ACF & PACF plots.
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(res.m5)
## Warning in adf.test(res.m5): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: res.m5
## Dickey-Fuller = -7.9982, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
EACF
To see information about the AR and MA parts that are still present in the residuals from the previous step, or res.m5, we will now use EACF. Our ARMA part candidates are ARMA(1,2), ARMA(2,2)& ARMA(2,1)
#eacf(res.m5)
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x o o o o o o o o o o o o o
1 x x o o o o o o o o o o o o
2 x o o o o o o o o o o o o o
3 x o o x o o o o o o o x o o
4 x o o x o o o o o o o o o o
5 x x x x o o o o o o o o o o
6 x o x o x o o o o o o o o o
7 x o x o x o o o o o o o o o
The tentative models are specified as:
• SARIMA (0,1,4)x(1,1,2) by ACF if we consider pacf is tailing off so we take only MA coefficient.
• SARIMA (0,1,3)x(1,1,2) by ACF if we consider pacf is tailing off so we take only MA coefficient.
• SARIMA (3,1,4)x(1,1,2) by ACF/PACF
• SARIMA (2,1,1)x(1,1,2) by EACF
• SARIMA (2,1,2)x(1,1,2) by EACF
• SARIMA (3,1,2)x(1,1,2) for over fitting SARIMA (2,1,2)x(1,1,2)
1.SARIMA (0,1,3)x(1,1,2)
model2.cars = arima(log.cars.ts,order=c(0,1,3),seasonal=list(order=c(1,1,2),period=12),method ="ML")
#coeftest(model2.cars)
ACF/PACF exhibits the presence of white noise, making it one of the better models. We then fit additional models and compare results.
ACF/PACF plot of the residuals for ARIMA(0,1,3)x(1,1,2)
res.model2 = residuals(model2.cars)
par(mfrow=c(1,2))
acf(res.model2, lag.max = 36)
pacf(res.model2, lag.max = 36)
SARIMA(0,1,4)x(1,1,2)
model1.cars = arima(log.cars.ts,order=c(0,1,4),seasonal=list(order=c(1,1,2), period=12),method = "ML")
#coeftest(model1.cars)
According to the coefficient test, MA(2) and MA(4) are not significant, while MA(3) is marginally significant.
Thus, we are unable to use this model for additional analysis.
Also in PACF, we see some slightly significant lag.
ACF/PACF plot of the residuals for ARIMA(0,1,4)x(1,1,2)
res.model1 = residuals(model1.cars);
par(mfrow=c(1,2))
acf(res.model1, lag.max = 36)
pacf(res.model1, lag.max = 36)
SARIMA(3,1,4)x(1,1,2)
We go with the bigger model from ACF/PACF
model3.cars = arima(log.cars.ts,order=c(3,1,4),seasonal=list(order=c(1,1,2),period=12),method = "ML")
#coeftest(model3.cars)
The coefficient test does not reveal any significant results, but we can see that the residuals’ white noise is present.
So, we later determine whether or not this model is appropriate.
ACF/PACF plot of the residuals for ARIMA(3,1,4)x(1,1,2)
res.model3 = residuals(model3.cars)
par(mfrow=c(1,2))
acf(res.model3, lag.max = 36)
pacf(res.model3, lag.max = 36)
SARIMA(2,1,1)x(1,1,2)
model4.cars = arima(log.cars.ts,order=c(2,1,1),seasonal=list(order=c(1,1,2), period=12),method = "ML")
#coeftest(model4.cars)
MA(1) coefficient shows significant values but AR component doesn’t. Father ACF/PACF plots shows slightly significant lags.
ACF/PACF plot of the residuals for ARIMA(2,1,1)x(1,1,2)
res.model4 = residuals(model4.cars)
par(mfrow=c(1,2))
acf(res.model4, lag.max = 36)
pacf(res.model4, lag.max = 36)
SARIMA(2,1,2)x(1,1,2)
model5.cars = arima(log.cars.ts,order=c(2,1,2),seasonal=list(order=c(1,1,2),
period=12),method = "ML")
#coeftest(model5.cars)
Only AR(2) is significant. ACF/PACF shows presence of white noise
ACF/PACF plot of the residuals for ARIMA(2,1,2)x(1,1,2)
res.model5 = residuals(model5.cars)
par(mfrow=c(1,2))
acf(res.model5, lag.max = 36)
pacf(res.model5, lag.max = 36)
AIC & BIC
We computed the coefficient test and fitted every possible model in the list above. We can use the AIC/BIC function to choose the best model.
sort.score <- function(x, score = c("bic", "aic")){
if (score == "aic"){
x[with(x, order(AIC)),]
} else if (score == "bic") {
x[with(x, order(BIC)),]
} else {
warning('score = "x" only accepts valid arguments ("aic","bic")')
}
}
sc.AIC=AIC(model1.cars,model2.cars,model3.cars,model4.cars,model5.cars)
sc.BIC=AIC(model1.cars,model2.cars,model3.cars,model4.cars,model5.cars,k=log(length(cars.ts)))
AIC gives us model2 i.e. SARIMA(0,1,3)x(1,1,2) as best model.
sort.score(sc.AIC, score = "aic")
## df AIC
## model4.cars 7 53.84486
## model2.cars 7 53.84549
## model1.cars 8 55.44509
## model5.cars 8 55.64937
## model3.cars 11 59.11388
BIC also gives us the same model2.cars i.e. SARIMA(0,1,3)x(1,1,2) as best one.
sort.score(sc.BIC, score = "aic")
## df AIC
## model4.cars 7 72.61978
## model2.cars 7 72.62041
## model1.cars 8 76.90214
## model5.cars 8 77.10642
## model3.cars 11 88.61732
We always opt for a model whose residuals contain white noise, so we always choose one with residuals like that.
The specified model is further examined using residual analysis.
Model Diagnostics
We will use SARIMA(0,1,3)x(1,1,2) for parameter estimation, parameter significance testing, and forecasting.
Additionally, we must verify that the residuals are normal (on top of being presence of white noise)
residual.analysis <- function(model, std = TRUE){
library(TSA)
#library(FitAR)
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
par(mfrow=c(2,2))
plot(res.model,type='o',ylab='Standardized residuals', main="Time series plot of standardized residuals")
abline(h=0)
hist(res.model,main="Histogram of standardized residuals")
qqnorm(res.model,main="QQ plot of standardized residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardized residuals")
print(shapiro.test(res.model))
}
residual.analysis(model=model2.cars)
## Warning: package 'TSA' was built under R version 4.2.2
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.67961, p-value = 5.033e-14
ARIMA(0,1,3)x(1,1,2)
From the residual analysis of SARIMA (0,1,3)x(1,1,2) model we draw the following conclusions:
residual.analysis(model=model4.cars)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.681, p-value = 5.416e-14
ARIMA(2,1,1)x(1,1,2)
From the residual analysis of SARIMA (2,1,1)x(1,1,2) model we draw the following conclusions:
residual.analysis(model=model1.cars)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.68788, p-value = 7.812e-14
ARIMA(2,1,2)x(1,1,2) From the residual analysis of SARIMA (2,1,2)x(1,1,2) model we draw the following conclusions:
residual.analysis(model=model5.cars)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.68436, p-value = 6.474e-14
Forecasting
After selecting the top model, ARIMA (0,1,3)x(1,1,2), we will take it into consideration for time series prediction going forward. Forecasts for the model that we fitted are shown in the image below with a lead period of 10 years.
The name of the data series, the number of times we want a forecast (10 years in our example), and then the parameters of the ARIMA model are listed in the command’s sequence of parameters.
plot(log.cars.ts)
Using the model we determined to be the best out of a range of feasible models, we have ultimately predicted automobile sales for the upcoming ten years. The estimate indicates an upward tendency, which is plausible given that car sales may increase annually due to technological improvement, population expansion, and other factors.