Forecast Tractor Sales using Time Series ARIMA Model

AIM

Forecast tractor sales for the next 36 months.

PROBLEM STATEMENT:

PowerHorse, a tractor and farm equipment manufacturing company, was established a few years after World War II. The company has shown a consistent growth in its revenue from tractor sales since its inception. However, over the years the company has struggled to keep it’s inventory and production cost down because of variability in sales and tractor demand. The management at PowerHorse is under enormous pressure from the shareholders and board to reduce the production cost.

You have to develop an ARIMA model to forecast sale / demand of tractors for next 3 years.

DATA DICTIONARY

The dataset consists of 144 observations having the total monthwise sales data of Tractors for a period of past 12 years.

Initial setup:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)

Step 1: Read and Plot tractor sales data as time series

data=read.csv("Tractor-Sales.csv")
#144 obs and 2 variables

data = ts(data[,2],start = c(2003,1),frequency = 12)
#data is converted into time series.
#NOTE:we only take second column(number of tractor sold) for further analysis.not first column.
plot(data, xlab='Years', ylab = 'Tractor Sales')

Clearly the above chart has an upward trend for tractors sales and there is also a seasonal component.

Step 2: Difference data to make data stationary on mean (remove trend)

The next thing to do is to make the series stationary. This to remove the upward trend through 1st order differencing.

plot(diff(data),ylab='Differenced Tractor Sales')

#hence trend is removed.

Now, the above series is not stationary on variance i.e. variation in the plot is increasing as we move towards the right of the chart. We need to make the series stationary on variance to produce reliable forecasts through ARIMA models.

Step 3: log transform data to make data stationary on variance

One of the best ways to make a series stationary on variance is through transforming the original series through log transform. We will go back to our original tractor sales series and log transform it to make it stationary on variance.

Notice, the following series is not stationary on mean since we are using the original data without differencing.

plot(log10(data),ylab='Log (Tractor Sales)')

Now the above series looks stationary on variance.

Step 4: Difference log transform data to make data stationary on both mean and variance.

Let us look at the differenced plot for log transformed series to reconfirm if the series is actually stationary on both mean and variance.

plot(diff(log10(data)),ylab='Differenced Log (Tractor Sales)')

Yes, now the above series looks stationary on both mean and variance.

This also gives us the clue that I or integrated part of our ARIMA model will be equal to 1 as 1st difference is making the series stationary.

Step 5: Plot ACF and PACF to identify potential AR and MA model

Let us create autocorrelation factor (ACF) and partial autocorrelation factor (PACF) plots to identify patterns in the above data which is stationary on both mean and variance.

The idea is to identify presence of AR and MA components in the residuals.

par(mfrow = c(1,2))

acf(ts(diff(log10(data))),main='ACF Tractor Sales')

pacf(ts(diff(log10(data))),main='PACF Tractor Sales')

Observation by acf and pacf plots:

Since, there are enough spikes in the plots outside the insignificant zone (dotted horizontal lines) we can conclude that the residuals are not random.

This implies that there is juice or information available in residuals to be extracted by AR and MA models.

Also, there is a seasonal component available in the residuals at the lag 12 (represented by spikes at lag 12). This makes sense since we are analyzing monthly data that tends to have seasonality of 12 months because of patterns in tractor sales.

Step 6: Identification of best fit ARIMA model

Auto arima function in forecast package in R helps us identify the best fit ARIMA model on the fly.

library(tseries)

## Warning: package 'tseries' was built under R version 3.3.3

ARIMAfit = auto.arima(log10(data), approximation=FALSE,trace=FALSE)
summary(ARIMAfit)

## Series: log10(data) 
## ARIMA(0,1,1)(0,1,1)[12]                    
## 
## Coefficients:
##           ma1     sma1
##       -0.4047  -0.5529
## s.e.   0.0885   0.0734
## 
## sigma^2 estimated as 0.0002571:  log likelihood=354.4
## AIC=-702.79   AICc=-702.6   BIC=-694.17
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE      MAPE
## Training set 0.0002410698 0.01517695 0.01135312 0.008335713 0.4462212
##                   MASE       ACF1
## Training set 0.2158968 0.01062604

Observations: The best fit model is selected based on Akaike Information Criterion (AIC) , and Bayesian Information Criterion (BIC) values. The idea is to choose a model with minimum AIC and BIC values.

As expected, our model has I (or integrated) component equal to 1. This represents differencing of order 1.

There is additional differencing of lag 12 in the above best fit model. Moreover, the best fit model has MA value of order 1. Also, there is seasonal MA with lag 12 of order 1.

Step 7: Forecast sales using the best fit ARIMA model

The next step is to predict tractor sales for next 3 years i.e. for 2015, 2016, and 2017 through the above model.

par(mfrow = c(1,1))
pred = predict(ARIMAfit, n.ahead = 36) #36 months makes 3 years
pred

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2015 2.754168 2.753182 2.826608 2.880192 2.932447 2.912372 2.972538
## 2016 2.796051 2.795065 2.868491 2.922075 2.974330 2.954255 3.014421
## 2017 2.837934 2.836948 2.910374 2.963958 3.016213 2.996138 3.056304
##           Aug      Sep      Oct      Nov      Dec
## 2015 2.970585 2.847264 2.797259 2.757395 2.825125
## 2016 3.012468 2.889147 2.839142 2.799278 2.867008
## 2017 3.054351 2.931030 2.881025 2.841161 2.908891
## 
## $se
##             Jan        Feb        Mar        Apr        May        Jun
## 2015 0.01603508 0.01866159 0.02096153 0.02303295 0.02493287 0.02669792
## 2016 0.03923008 0.04159145 0.04382576 0.04595157 0.04798329 0.04993241
## 2017 0.06386474 0.06637555 0.06879478 0.07113179 0.07339441 0.07558934
##             Jul        Aug        Sep        Oct        Nov        Dec
## 2015 0.02835330 0.02991723 0.03140337 0.03282229 0.03418236 0.03549035
## 2016 0.05180825 0.05361850 0.05536960 0.05706700 0.05871534 0.06031866
## 2017 0.07772231 0.07979828 0.08182160 0.08379608 0.08572510 0.08761165

plot(data,type='l',xlim=c(2004,2018),ylim=c(1,1600),xlab = 'Year',ylab = 'Tractor Sales')
lines(10^(pred$pred),col='blue')
lines(10^(pred$pred+2*pred$se),col='orange')
lines(10^(pred$pred-2*pred$se),col='orange')

The above is the output with forecasted values of tractor sales in blue. Also, the range of expected error (i.e. 2 times standard deviation) is displayed with orange lines on either side of predicted blue line.

Assumptions while forecasting: Forecasts for a long period of 3 years is an ambitious task. The major assumption here is that the underlining patterns in the time series will continue to stay the same as predicted in the model. A short-term forecasting model, say a couple of business quarters or a year, is usually a good idea to forecast with reasonable accuracy. A long-term model like the one above needs to evaluated on a regular interval of time (say 6 months). The idea is to incorporate the new information available with the passage of time in the model.

Step 8: Plot ACF and PACF for residuals of ARIMA model to ensure no more information is left for extraction.

Finally, let’s create an ACF and PACF plot of the residuals of our best fit ARIMA model i.e. ARIMA(0,1,1)(0,1,1)[12].

par(mfrow=c(1,2))
acf(ts(ARIMAfit$residuals),main='ACF Residual')

pacf(ts(ARIMAfit$residuals),main='PACF Residual')

Since there are no spikes outside the insignificant zone for both ACF and PACF plots we can conclude that residuals are random with no information or juice in them. Hence our ARIMA model is working good and predictions were successfully made.