Our group will take the dataset about Car sales in Quebec (thousands) Jan 1960 - Dec 1968. First of all, we download the dataset from website (https://www.york.ac.uk/depts/maths/data/ts/) and import it for this final project.

Insight of the dataset:

Cars<-read.csv("E:/dataR/carsale.csv")
attach(Cars)
names(Cars)

## [1] "Year"  "Month" "Sales"

length(Sales)

## [1] 108

The column Sales shows the total number of cars sold. Dataset consists of the records from Jan 1960 to Nov 1968. There are total 108 observations.

library(psych)

## Warning: package 'psych' was built under R version 4.2.2

describe(Sales)

##    vars   n     mean      sd median trimmed     mad  min   max range skew
## X1    1 108 14511.78 4690.21  14076 14456.4 4612.37 1015 26099 25084 0.12
##    kurtosis     se
## X1    -0.32 451.32

summary(Cars)

##       Year          Month           Sales      
##  Min.   :1960   Min.   : 1.00   Min.   : 1015  
##  1st Qu.:1962   1st Qu.: 3.75   1st Qu.:11391  
##  Median :1964   Median : 6.50   Median :14076  
##  Mean   :1964   Mean   : 6.50   Mean   :14512  
##  3rd Qu.:1966   3rd Qu.: 9.25   3rd Qu.:17596  
##  Max.   :1968   Max.   :12.00   Max.   :26099

sum(is.na(Cars))

## [1] 0

Here are some statistics related to this dataset.

Methodology, Analysis, and results

We don’t need to clean this dataset because it doesn’t contain NA values. We then create a graph for descriptive analysis of this dataset.

boxplot(Sales, horizontal=FALSE, main="Box plot of Cars", ylab="Car sales", col = "red")

There are no outliers in the data on car sales. So that we can observe the dataset, we then plot the graph.

As we observe non stationarity due to upwards trend, we then plot ACF & PACF respectively.

cars.ts <- matrix(Sales,nrow=108,ncol=1)
cars.ts<- as.vector(t(cars.ts))
cars.ts <- ts(cars.ts,start=c(1960,1), end=c(1968,12), frequency=12)
plot(cars.ts,type='o',ylab='Car Sales')

ACF plot for the data and PACF plot of the data

par(mfrow=c(1,2))
acf(cars.ts,lag.max = 36)
pacf(cars.ts, lag.max = 36)

Seasonality:

The seasonal ARIMA model is :

\[ARIMA(p, d, q)\times (P, D, Q)s\]

with:

p = non-seasonal AR order
d = non-seasonal differencing
q = non-seasonal MA order
P = seasonal AR order
D = seasonal differencing
Q = seasonal MA order and s is time span of repeating seasonal pattern.

Specification of seasonal part (D=1)

Firstly, we do seasonal differencing to get rid of seasonal trend and fitting a plain model until time series and ACF/PACF plots of residuals show no sign of seasonality.
Then we determine the orders of P & Q of the seasonal part based on final ACF/PACF plots of residuals.
We do so by fitting the ARIMA(0, 0, 0) x (0, 1, 0) model and plotting the graphs.
Although the general upward trend is resolved, we plot the ACF & PACF.

m1.cars = arima(cars.ts,order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res.m1 = residuals(m1.cars);
par(mfrow=c(1,1))
plot(res.m1,xlab='Time',ylab='Residuals', col="blue", main="Time series plot of 1st seasonal difference")

ACF & PACF plot

The ACF/PACF plots show seasonal trend is filtered out.
ACF & PACF shows 1 seasonal lag , indicates SARMA(1,1)

ACF/PACF plot of the residuals

par(mfrow=c(1,2))
acf(res.m1, lag.max = 36)
pacf(res.m1, lag.max = 36)

Specification of seasonal part (D=1, P=1, Q=1)

The general upward trend is no longer seen. We plot the residuals ACF & PACF plots.

Time series plot with seasonal AR & MA coefficient (P=1,Q=1)

m2.cars = arima(cars.ts,order=c(0,0,0),seasonal=list(order=c(1,1,1), period=12))
res.m2 = residuals(m2.cars);
par(mfrow=c(1,1))
plot(res.m2,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")

ACF & PACF plots

On the seasonal lags, autocorrelation is still present. In order to remove seasonality, the process must be repeated with a higher number of Q.

ACF/PACF plot of the residuals with AR & MA coefficient (P=1,D=1,Q=1)

par(mfrow=c(1,2))
acf(res.m2, lag.max = 36)
pacf(res.m2, lag.max = 36)

Specification of seasonal part (D=1, P=1, Q=2)

Seasonal lags are no longer an issue. Thus, at P=1, D=1, and Q=2, the seasonality specification is complete. We can also notice that the regular pane has a number of lags.

We apply transformation first and look for any substantial lags because we don’t notice any trend.

ACF/PACF plot of the residuals with P=1,D=1,Q=2

m3.cars = arima(cars.ts,order=c(0,0,0),seasonal=list(order=c(1,1,2), period=12))
res.m3 = residuals(m3.cars)
par(mfrow=c(1,2))
acf(res.m3, lag.max = 36)
pacf(res.m3, lag.max = 36)

Transformation

Data transformation are important tools for proper statistical analysis.

The time series data was subjected to a log transformation before being graphed.

Time series plot with transformed data

log.cars.ts = log(cars.ts)
par(mfrow=c(1,1))
plot(log.cars.ts,ylab='log of sales count',xlab='Year',type='o', col="dark green")

We fit the model with log transformed data and draw conclusions from sample ACF & PACF’s Time series plot of the residuals after transformation

m4.cars = arima(log.cars.ts,order=c(0,0,0),seasonal=list(order=c(1,1,2), period=12))
res.m4 = residuals(m4.cars)
plot(res.m4,xlab='Time',ylab='Residuals',main="Time series plot of the residuals", col="red")

The ACF plot shows two significant lags before 1st seasonal lag. PACF shows 2 significant lags.

ACF plot of the residuals after transformation and PACF plot of the residuals after transformation

par(mfrow=c(1,2))
acf(res.m4, lag.max = 36)
pacf(res.m4, lag.max = 36)

Non-seasonal Differencing

The creation of a list of potential models is the primary goal here. Therefore, let’s begin with d=1.

We begin with simple differencing with d=1 to remove any lingering trend and correlation in ACF/PACF plots because we don’t observe any decaying pattern.

We gain something from differentiation. It aids in the development of a model.

ACF/PACF plot of the residuals with ordinary differencing

# SARIMA(0,1,0)x(1,1,2)
m5.cars = arima(log.cars.ts,order=c(0,1,0),seasonal=list(order=c(1,1,2), period=12))
res.m5 = residuals(m5.cars)
par(mfrow=c(1,2))
acf(res.m5, lag.max = 36)
pacf(res.m5, lag.max = 36)

At the first lag, we obtain a high correlation and also note a small number of significant lags. The intervention point is to blame for everything.

We come up with MA(3 or 4)and AR(3) from ACF & PACF plots.

library(tseries)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

adf.test(res.m5)

## Warning in adf.test(res.m5): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  res.m5
## Dickey-Fuller = -7.9982, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

EACF

To see information about the AR and MA parts that are still present in the residuals from the previous step, or res.m5, we will now use EACF. Our ARMA part candidates are ARMA(1,2), ARMA(2,2)& ARMA(2,1)

#eacf(res.m5)

AR/MA

0 1 2 3 4 5 6 7 8 9 10 11 12 13

0 x o o o o o o o o o o o o o

1 x x o o o o o o o o o o o o

2 x o o o o o o o o o o o o o

3 x o o x o o o o o o o x o o

4 x o o x o o o o o o o o o o

5 x x x x o o o o o o o o o o

6 x o x o x o o o o o o o o o

7 x o x o x o o o o o o o o o

The tentative models are specified as:

• SARIMA (0,1,4)x(1,1,2) by ACF if we consider pacf is tailing off so we take only MA coefficient.

• SARIMA (0,1,3)x(1,1,2) by ACF if we consider pacf is tailing off so we take only MA coefficient.

• SARIMA (3,1,4)x(1,1,2) by ACF/PACF

• SARIMA (2,1,1)x(1,1,2) by EACF

• SARIMA (2,1,2)x(1,1,2) by EACF

• SARIMA (3,1,2)x(1,1,2) for over fitting SARIMA (2,1,2)x(1,1,2)

Model Fitting

1.SARIMA (0,1,3)x(1,1,2)

model2.cars = arima(log.cars.ts,order=c(0,1,3),seasonal=list(order=c(1,1,2),period=12),method ="ML")
#coeftest(model2.cars)

ACF/PACF exhibits the presence of white noise, making it one of the better models. We then fit additional models and compare results.

ACF/PACF plot of the residuals for ARIMA(0,1,3)x(1,1,2)

res.model2 = residuals(model2.cars)
par(mfrow=c(1,2))
acf(res.model2, lag.max = 36)
pacf(res.model2, lag.max = 36)

SARIMA(0,1,4)x(1,1,2)

model1.cars = arima(log.cars.ts,order=c(0,1,4),seasonal=list(order=c(1,1,2), period=12),method = "ML")
#coeftest(model1.cars)

According to the coefficient test, MA(2) and MA(4) are not significant, while MA(3) is marginally significant.

Thus, we are unable to use this model for additional analysis.

Also in PACF, we see some slightly significant lag.

ACF/PACF plot of the residuals for ARIMA(0,1,4)x(1,1,2)

res.model1 = residuals(model1.cars);
par(mfrow=c(1,2))
acf(res.model1, lag.max = 36)
pacf(res.model1, lag.max = 36)

SARIMA(3,1,4)x(1,1,2)

We go with the bigger model from ACF/PACF

model3.cars = arima(log.cars.ts,order=c(3,1,4),seasonal=list(order=c(1,1,2),period=12),method = "ML")
#coeftest(model3.cars)

The coefficient test does not reveal any significant results, but we can see that the residuals’ white noise is present.

So, we later determine whether or not this model is appropriate.

ACF/PACF plot of the residuals for ARIMA(3,1,4)x(1,1,2)

res.model3 = residuals(model3.cars)
par(mfrow=c(1,2))
acf(res.model3, lag.max = 36)
pacf(res.model3, lag.max = 36)

SARIMA(2,1,1)x(1,1,2)

model4.cars = arima(log.cars.ts,order=c(2,1,1),seasonal=list(order=c(1,1,2), period=12),method = "ML")
#coeftest(model4.cars)

MA(1) coefficient shows significant values but AR component doesn’t. Father ACF/PACF plots shows slightly significant lags.

ACF/PACF plot of the residuals for ARIMA(2,1,1)x(1,1,2)

res.model4 = residuals(model4.cars)
par(mfrow=c(1,2))
acf(res.model4, lag.max = 36)
pacf(res.model4, lag.max = 36)

SARIMA(2,1,2)x(1,1,2)

model5.cars = arima(log.cars.ts,order=c(2,1,2),seasonal=list(order=c(1,1,2),
period=12),method = "ML")
#coeftest(model5.cars)

Only AR(2) is significant. ACF/PACF shows presence of white noise

ACF/PACF plot of the residuals for ARIMA(2,1,2)x(1,1,2)

res.model5 = residuals(model5.cars)
par(mfrow=c(1,2))
acf(res.model5, lag.max = 36)
pacf(res.model5, lag.max = 36)

AIC & BIC

We computed the coefficient test and fitted every possible model in the list above. We can use the AIC/BIC function to choose the best model.

sort.score <- function(x, score = c("bic", "aic")){
 if (score == "aic"){
 x[with(x, order(AIC)),]
 } else if (score == "bic") {
 x[with(x, order(BIC)),]
 } else {
 warning('score = "x" only accepts valid arguments ("aic","bic")')
 }
}

sc.AIC=AIC(model1.cars,model2.cars,model3.cars,model4.cars,model5.cars)
sc.BIC=AIC(model1.cars,model2.cars,model3.cars,model4.cars,model5.cars,k=log(length(cars.ts)))

AIC gives us model2 i.e. SARIMA(0,1,3)x(1,1,2) as best model.

sort.score(sc.AIC, score = "aic")

##             df      AIC
## model4.cars  7 53.84486
## model2.cars  7 53.84549
## model1.cars  8 55.44509
## model5.cars  8 55.64937
## model3.cars 11 59.11388

BIC also gives us the same model2.cars i.e. SARIMA(0,1,3)x(1,1,2) as best one.

sort.score(sc.BIC, score = "aic")

##             df      AIC
## model4.cars  7 72.61978
## model2.cars  7 72.62041
## model1.cars  8 76.90214
## model5.cars  8 77.10642
## model3.cars 11 88.61732

We always opt for a model whose residuals contain white noise, so we always choose one with residuals like that.

The specified model is further examined using residual analysis.

Model Diagnostics

We will use SARIMA(0,1,3)x(1,1,2) for parameter estimation, parameter significance testing, and forecasting.

Additionally, we must verify that the residuals are normal (on top of being presence of white noise)

residual.analysis <- function(model, std = TRUE){
library(TSA)
#library(FitAR)
 if (std == TRUE){
 res.model = rstandard(model)
 }else{
 res.model = residuals(model)
 }
 par(mfrow=c(2,2))
 plot(res.model,type='o',ylab='Standardized residuals', main="Time series plot of standardized residuals")
 abline(h=0)
 hist(res.model,main="Histogram of standardized residuals")
 qqnorm(res.model,main="QQ plot of standardized residuals")
 qqline(res.model, col = 2)
 acf(res.model,main="ACF of standardized residuals")
 print(shapiro.test(res.model))
}

residual.analysis(model=model2.cars)

## Warning: package 'TSA' was built under R version 4.2.2

## 
## Attaching package: 'TSA'

## The following objects are masked from 'package:stats':
## 
##     acf, arima

## The following object is masked from 'package:utils':
## 
##     tar

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.67961, p-value = 5.033e-14

ARIMA(0,1,3)x(1,1,2)

From the residual analysis of SARIMA (0,1,3)x(1,1,2) model we draw the following conclusions:

residual.analysis(model=model4.cars)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.681, p-value = 5.416e-14

ARIMA(2,1,1)x(1,1,2)

From the residual analysis of SARIMA (2,1,1)x(1,1,2) model we draw the following conclusions:

residual.analysis(model=model1.cars)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.68788, p-value = 7.812e-14

ARIMA(2,1,2)x(1,1,2) From the residual analysis of SARIMA (2,1,2)x(1,1,2) model we draw the following conclusions:

residual.analysis(model=model5.cars)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.68436, p-value = 6.474e-14

Forecasting

After selecting the top model, ARIMA (0,1,3)x(1,1,2), we will take it into consideration for time series prediction going forward. Forecasts for the model that we fitted are shown in the image below with a lead period of 10 years.

The name of the data series, the number of times we want a forecast (10 years in our example), and then the parameters of the ARIMA model are listed in the command’s sequence of parameters.

plot(log.cars.ts)

Conclusion

Using the model we determined to be the best out of a range of feasible models, we have ultimately predicted automobile sales for the upcoming ten years. The estimate indicates an upward tendency, which is plausible given that car sales may increase annually due to technological improvement, population expansion, and other factors.

Car sales in Quebec (thousands) Jan 1960 - Dec 1968

Tran Thi Chinh 20192635 - Dau Thi Thu Hang 20192639

2022-12-02