1.Analysis objective

This analysis has an objective to forecast revenue collection for the next 12 month from the last available data in 2021. Based on monthly Pakistan Tax Collection data, these time series cumulate from 2001 to the latest 2020 available data. __ This is still under development and in detail video on how to make this analysis and creating a file like this one will be uploade on Data analysis and econometric

Main purpose of this analysis is to develop a tutorial for students who use time series modeling for forecasting. However, FBR officials can also use this analysis for forecasting and develop scenario for themselves. Our time series is a conventional univariate time series with date and monthly tax variables which we will list in the following chunk. The step applied in this analysis can be used in any forecast analysis.

2. Models used: explanation

This analysis is statistical forecasting (not economic forecasting) based on three classic forecasting models(Seasonal naïve, Exponential smoothing, ARIMA) and experienced one neural network model Seasonal naïve model.

I will start from very basics and will increase complexity gradually. Let me explain each model briefly.

As it will be observed in the following graphs that data have strong seasonality. Therefore, the first more basics models applied is the Seasonal Naive model. The seasonal naive model makes the forecast using the last value from the same season, for example, the year before to forecast the value for the next year.

The second model applied is the exponentials state smoothing method by using ETS model who refer to error, trend and seasonality. This model can perform better in a short-term and on a univariate time series forecast. The model uses the exponentially weighted moving average (EWMA) to “smooth” a time series and trying to eliminate the random effect. The model uses a smoothing constant (a) which is transformed into a damping factor (1-a), the constant (a) define the weight which is applied to each period. ETS model can be applied as additive or multiplicative, but R selects the most optimal.

The third model applied is the autoregressive integrated moving average (ARIMA). ARIMA models are more complex model than the two previous models mainly because of the algorithm which backed this forecast model in R. ARIMA is the combination of two models. First, autoregressive model AR(p), which forecast the variable of interest using a linear combination of past values of the variable, where (p) is the lag number. Second, moving average models MA(q), which is applied as a linear regression of the current value of the series against current and previous white noise error terms or random shocks.

The fourth model applied, are neural network models (NN), which is the most complex model used in this analysis. This NN model performs in nonlinear time series and with big data sets. Because there seems nonlinearity in our data, we decided to test the predictive capacity of this model. The NN model is organised in multiples layers, the simplest networks contain no hidden layers and are equivalent to linear regressions. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework by using a “learning algorithm”.

2.1 Analysis outline

First we upload data from the relevant working directory. How R and R Studio can be installed and data can be uploaded, watch R &R Studio and Data upload

Install R packages, load the data and declare this data series as a time series. Preliminary data observations

Data decomposition: Stationarity and identify lag specification, Seasonal component, Cycle component, Trend component.

Finding the most accurate model 4.1. Seasonal Naive method 4.2. ETS method, exponential smouthing models 4.3. ARIMA model 4.4. Neural network model Make the forecast for the next 12 months. Conclusion Create a report with Markdown

2.2 Package uploaded for this analysis

Some of these packages are not required for this analysis.

library(dynlm)
library(forecast)
library(readxl)
library(stargazer)
library(scales)
library(xts)
library(urca)
library(tidyverse)
library(tsibble)
library(fpp2)
library(fpp3)
library(ggthemes)
library(lubridate)

2.3 Upload data

This data is about monthly tax collection in billion of rupees (data source FBR).I have uploaded data from the respective directory.

tax_data <- read_excel("C:/Users/hp/OneDrive - Higher Education Commission/Macroeconometrics/Tax_billion_rs.xlsx") #Source FBR 
tax_data

## # A tibble: 223 x 7
##    Date                  tax direct indirect sales excise custom
##    <dttm>              <dbl>  <dbl>    <dbl> <dbl>  <dbl>  <dbl>
##  1 2001-12-01 00:00:00  39     20.6     18.3  13.6    3.3    1.4
##  2 2002-01-01 00:00:00  32.8   13.4     19.4  12.6    3.4    3.4
##  3 2002-02-01 00:00:00  27.6    7.5     20.1  13.2    3.7    3.2
##  4 2002-03-01 00:00:00  34.9   12.5     22.4  14.1    3.9    4.4
##  5 2002-04-01 00:00:00  36.3   12.3     24    15.1    4.7    4.2
##  6 2002-05-01 00:00:00  37.4   10.3     27.1  16.9    4.9    5.3
##  7 2002-06-01 00:00:00  60.3   23.8     36.5  20.9    6.1    9.5
##  8 2002-07-01 00:00:00  23.6    4.8     18.8  12.3    2.6    3.9
##  9 2002-08-01 00:00:00  29.6    7       22.6  14.5    3.5    4.6
## 10 2002-09-01 00:00:00  37.2   11.8     25.4  16.9    3.6    4.9
## # ... with 213 more rows

2.4 Time Plot

We have used “The Economist Themse”, one may use one’s own choice like Financil Times, Fivethirtyeight.com etc.

ggplot(tax_data)+aes(x=Date, y=tax)+geom_line()+
   ylab("Rs. in Billion") +
  ggtitle("Monthly tax collection year wise in billion of Rs. by FBR")+ labs(caption=" Source :FBR")+theme_economist()## Plot total taxes with The Economist theme

## 2.5 Multiple line plot There are total taxes, direct and indirect taxes, and subcomponents of indirect taxes- sales tax, custom and excise duties. Now we plot direct and indirect taxes. How t

ggplot(tax_data, aes(x=Date)) + 
  geom_line(aes(y = direct), color = "darkred") + 
  geom_line(aes(y = indirect), color="steelblue", linetype="twodash") #Method 1

2.5 Another approach for multiple line chart

df <- tax_data %>%
  select(Date, tax, direct, indirect) %>%
  gather(key = "variable", value = "value", -Date)
df

## # A tibble: 669 x 3
##    Date                variable value
##    <dttm>              <chr>    <dbl>
##  1 2001-12-01 00:00:00 tax       39  
##  2 2002-01-01 00:00:00 tax       32.8
##  3 2002-02-01 00:00:00 tax       27.6
##  4 2002-03-01 00:00:00 tax       34.9
##  5 2002-04-01 00:00:00 tax       36.3
##  6 2002-05-01 00:00:00 tax       37.4
##  7 2002-06-01 00:00:00 tax       60.3
##  8 2002-07-01 00:00:00 tax       23.6
##  9 2002-08-01 00:00:00 tax       29.6
## 10 2002-09-01 00:00:00 tax       37.2
## # ... with 659 more rows

# Visualization
ggplot(df, aes(x = Date, y = value)) + 
  geom_line(aes(color = variable, linetype = variable)) + 
  scale_color_manual(values = c("darkblue", "darkred", "steelblue")) +ylab("Rs. in Billion") +
  ggtitle("Monthly tax collection year wise in billion of Rs. by FBR")+ labs(caption=" Source :FBR")+theme_economist()

We observe a strong positive trend and seasonality. Trend over time seems bit nonlinear. Even though, in this first observation, we can intuitively conclude that these time series have no stationary. In the next section, we will investigate and identify those potential problems in the way to make some correction to this time series and for selecting the most appropriate forecasting model.

3 Preliminary data decomposition

3.1 Why we investigate Stationarity? (unit root test)

A time series is stationary if it’s characteristics like mean, variance, covariance, are time variant: that is, they do not change over the time. Non-stationarity may cause autocorrelation which we explained in the next step.

We will make the Dickey-Fuller Test to check the stationarity in the data.

library(tseries)

adf.test(tax_data$direct, alternative = "stationary", k=12)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tax_data$direct
## Dickey-Fuller = -2.2077, Lag order = 12, p-value = 0.4888
## alternative hypothesis: stationary

Null hypothesis of non-stationarity is not rejected. This test confirmed that this series is not stationary.

To correct the non-stationarity problem, we apply the first difference and make the Dickey-Fuller again.

DS <- diff(tax_data$direct)
adf.test(DS, alternative = "stationary", k=12)

## Warning in adf.test(DS, alternative = "stationary", k = 12): p-value smaller
## than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  DS
## Dickey-Fuller = -6.1464, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary

The Dickey-Fuller test allows you to reject the null hypothesis with this small p-value and we can conclude this series is stationary. By taking the first difference, we are making the making the corrections on this initial non-stationary time series.

We can visualize the data, for seeing the impact of the first difference on this time series.

DS<-tax_data %>% mutate(mom_pct=(direct/lag(direct)-1)*100)
ggplot(DS)+aes(x=Date,y=mom_pct)+geom_line()+ggtitle("Month on month basis changes in tax collection")

## Warning: Removed 1 row(s) containing missing values (geom_path).

We will take the first difference from the data to remove the trend. With this first difference, we can work with this time series without having the trend influence forecasting .

3.2 What problems are caused by non-stationary (trend)

Also, It’s important because it helps to identify the driving factors. When we detect a change in a time series, we may be able to infer a correlation. But we need both time series to be stationary (no trend and no seasonality); otherwise, the correlation we find will be misleading.

There are three main problems with stochastic trends: 1. AR coefficients can be badly biased towards zero. This means that if you estimate an AR and make forecasts, if there is a unit root then your forecasts can be poor (AR coefficients biased towards zero)

2.  Some t-statistics don’t have a standard normal distribution, even in large samples (more on this later)

3.  If Y and X both have random walk trends then they can look related even if they are not – you can get “spurious regressions.”

3.4 Why we investigate autocorrelation?

Autocorrelation means that there are correlations in the error or lag correlations of a given series with itself, lagged by a number of time units. Which signify:

\[Y_c =x + \beta X_i+u_i\]

\[Cov(u_i, u_s) \neq 0 \forall i\neq s\] Autocorrelation measures the linear relationship between lagged values of a time series. We will see dependence in the data across a range of lag value.

acf(tax_data$direct)

acf plot shows that the data are strongly non-random and autorgressive model maybe appropriate. We can check the autocorrelation by plotting residual and standardized residuals of regression against time and compare if they show a similar pattern which signs for autocorrelation.

3.6 Seasonality

Seasonality is a pattern which occurs when a time series is observed at less than annual frequency and is affected by seasonal factors such as the time of the year or the day of the week. We know that tax collection authorities usually have to meet quarterly target and some accounting lags also make collection a seasonal phenomena. We need to detect seasonality in a time series in the way to make the necessary adjustment or for choosing the appropriate models. Seasonality adjustment has three main reasons: 1. to aid in short-term forecasting
2. to help in relating time series to other series or extreme events 3. to allow series to be compared from year to year or month to month or the day today.

Following two visualizations will help to identify the seasonality.

ts_tax<-ts(DS$mom_pct,  frequency = 12,start = c(2001,1))
decomp<-decompose(ts_tax)
plot(decomp)

library(ggthemes)

ggseasonplot(ts_tax,year.labels=TRUE, year.labels.left=TRUE) +
  ylab("Rs. in Billion") +
  ggtitle("Monthly tax collection year wise in billion of Rs. by FBR")+ labs(caption=" Source :FBR")+theme_economist()

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_text).

We can observe that those 19 multiples colour lines, already have the same pattern over the years. Those colour lines informed you of the possible presence of seasonal cycles in this time series.

Now, look at this other seasonal plot which isolates the variation for one month at the time

ggsubseriesplot(ts_tax)

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 19 row(s) containing missing values (geom_path).

The horizontal lines indicate the means for each month. This form of plot enables the underlying seasonal pattern to be seen clearly, and also shows the changes in seasonality over time. It is especially useful in identifying changes within particular seasons.

Both line and seasonal plots indicate that there seems strong trend and seasonality pattern in tax data.

4. Fit a Model

In this section we are going to fit a model on tax data and will find out the most appropriate model. ## 4.1 seasonal naïve method as our benchmark

This model is used as a benchmark and performance of any technical model is compared with this model. This model using the most recent observation as a forecast. We are using seasonal naive model because data seems seasonal.

\[\hat{y}_{T+h|T}=y_{T+h-m(k+1)}\]

fit_SN <- snaive(ts_tax)

 forecast::snaive((ts_tax), h = 24)

##          Point Forecast      Lo 80      Hi 80       Lo 95     Hi 95
## Aug 2019     -61.781323 -116.64006  -6.922584 -145.680524  22.11788
## Sep 2019      -8.219178  -63.07792  46.639561  -92.118380  75.68002
## Oct 2019      87.734842   32.87610 142.593581    3.835640 171.63404
## Nov 2019     -36.228657  -91.08740  18.630082 -120.127859  47.67054
## Dec 2019      -2.371917  -57.23066  52.486823  -86.271118  81.52728
## Jan 2020     120.991254   66.13251 175.849993   37.092052 204.89046
## Feb 2020     -50.483729 -105.34247   4.375010 -134.382931  33.41547
## Mar 2020     -10.834813  -65.69355  44.023926  -94.734015  73.06439
## Apr 2020      36.553785  -18.30495  91.412524  -47.345417 120.45299
## May 2020     -39.815463  -94.67420  15.043276 -123.714665  44.08374
## Jun 2020      -3.664877  -58.52362  51.193862  -87.564079  80.23432
## Jul 2020     160.519066  105.66033 215.377805   76.619864 244.41827
## Aug 2020     -61.781323 -139.36330  15.800650 -180.432712  56.87007
## Sep 2020      -8.219178  -85.80115  69.362795 -126.870567 110.43221
## Oct 2020      87.734842   10.15287 165.316815  -30.916547 206.38623
## Nov 2020     -36.228657 -113.81063  41.353316 -154.880046  82.42273
## Dec 2020      -2.371917  -79.95389  75.210056 -121.023305 116.27947
## Jan 2021     120.991254   43.40928 198.573227    2.339865 239.64264
## Feb 2021     -50.483729 -128.06570  27.098244 -169.135118  68.16766
## Mar 2021     -10.834813  -88.41679  66.747159 -129.486202 107.81658
## Apr 2021      36.553785  -41.02819 114.135758  -82.097604 155.20517
## May 2021     -39.815463 -117.39744  37.766510 -158.466852  78.83593
## Jun 2021      -3.664877  -81.24685  73.917096 -122.316266 114.98651
## Jul 2021     160.519066   82.93709 238.101038   41.867677 279.17045

checkresiduals(fit_SN)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 78.066, df = 24, p-value = 1.233e-07
## 
## Model df: 0.   Total lags used: 24

We have Residual sd: 42.8065 from summary(fit_SN) command. The acf graph of residuals indicate autocorrelation among the residuals. Ljung Box statistics is also very high and it reject the null hypothesis that residuals are uncorrelated.

The naive and Snaive model are fundamental models. Some business uses those model basic forecasting models, maybe because of the lack of internal resources. Producing or maintaining extra stock it is a cost for the company and creates inefficiency. We continue testing the forecast performance of others models.

4.2 Fit ETS method, exponential smoothing method

Second, we will apply ETS model: Error, Trend, Seasonal. The flexibility of the ETS model lies in its ability to trend and seasonal components of different traits. This function ets() automatically optimizes the choice of model and necessary parameters. We present the structure of the additive and the multiplicative form.

Assuming: \[\mu_t = \hat{y}_t=l_{t-1}+ b_{t-1}\] and \[\varepsilon_t = y_{t} - \mu_{t}\]

ETS additive

\[y_t = l_{t-1} + \phi b_{t-1} + \varepsilon_t\]

\[l_t = l_{t-1} + \phi b_{t-1} + \alpha\varepsilon_t\]

\[b_t = \phi b_{t-1} + \beta^*(l_{t}-l_{t-1}- \phi b_{t-1} = \phi b_{t-1}+\alpha\beta^*\varepsilon_t\]

\[\varepsilon_t = (y_t-\mu_t ) / \mu_t\]

\[y_t = (l_{t-1} + \phi b_{t-1}) (1+\varepsilon_t)\]

\[l_t = (l_{t-1} + \phi b_{t-1}) (1+\alpha\varepsilon_t)\]

\[b_t = \phi b_{t-1}+\beta(l_{t-1} + \phi b_{t-1})\varepsilon_t\]

fit_ets <- ets(ts_tax) #residual = 221.1631

## Warning in ets(ts_tax): Missing values encountered. Using longest contiguous
## portion of time series

checkresiduals(fit_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 96.146, df = 10, p-value = 3.331e-16
## 
## Model df: 14.   Total lags used: 24

We have a residual sd =40.49 from summary(fit_ets) command, which are more accurate than the seasonal naïve models and what mean for the exact month the years before, which missing on average 40.49 billion.

So this model increases the precision and offer a better fit but: if we look to AFC graph, we observe that there remains autocorrelation because of the bar going out of the 95% confidence line.

The Ljung-Box test Test results also indicate rejection of null hypothesis that residuals are non-correlated.

We realize that with using just a bit more complex forecasting model, we increase accuracy and can make a significant improvement in tax collection. We continue our analysis with a one of the most performant forecasting model the ARIMA.

4.3 Fit on ARIMA model

ARIMA model is a Generalized random walk model which is fine-tuned to eliminate all residual autocorrelation. It is a Generalized exponential smoothing model that can incorporate long-term trends and seasonality.

AR(p) model \[(1-\sum^p_{k=1}\alpha_kL^k)X_t = \varepsilon_t\] MA(q) model \(X_t = (1+\sum^q_{k=1}\beta_kL^k)\varepsilon_t\) First difference is \(\Delta X_t=X_t -X_{t-1} = (1-L)X_t\) where \(\ Y_t = (1-L)X_t\)

ARIMA(p, d, q) full model \[(1-\sum^p_{k=1}\alpha_kL^k)(1-L)^dX_t = (1+\sum^q_{k=1}\beta_kL^k)\varepsilon_t\]

ARIMAX model

\[\Delta y_t=\alpha_0+\sum_{j}\alpha_j \Delta y_{t-j}+\sum_h\gamma_h\epsilon_{t-h}+X\beta+\epsilon_t\]

fit_ARIMA <- auto.arima(ts_tax, d=1,D=1, stepwise = FALSE, approximation = FALSE)
checkresiduals(fit_ARIMA)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)(0,1,1)[12]
## Q* = 24.722, df = 19, p-value = 0.1699
## 
## Model df: 5.   Total lags used: 24

We have a residual standard error (residual) = sqrt(1251), what mean for the exact month the years before, which missing on average 35 billion.

At this stage, we can conclude that the ARIMA model offers the best fit base on residual and standard deviations.

The Ljung-Box test Test results also indicate that model is appropriate fit as null hypothesis of non-autocorrelation among residuals is rejected. So far ARIMA model not only passes the diagnostic testing but also performs better than the seasonal naive and ETS models.

Though for economic theory or behavioral models, simpler the better. But for forecasting having capacity for applying and Understanding complex models is a plus value for any business or forecasting team. Many decisions are based on the capacity to forecast revenues. Accuracy in short-term revenue forecast has the critical impact on all economic planning. Accuracy on long-term forecasting, contribute for supporting the significant fiscal decision. In the end, short-term and long-term forecast accuracy will have a significant impact to optimize the growth of the economy. Lousy forecasting can reduce FBR capacity for effecient decision making.

4.4 Neural network models

This model allows complex nonlinear relationships between the response variable and its predictors. Our time series are mostly linear, and in this case, the neural network may not be the most appropriate model, but we will compare performance against seasonal naive, ETS and ARIMA.

In neural network the inputs of each node are combined using a weighted linear combination. For exemple, the inputs (for 4 inputs) into hidden neuron j are combined linearly to give \[z_j = b_j+\sum_{i=1}^4 w_i,_jx_i\] and in the hidden layer,this is then modified using a nonlinear function such as a sigmoid, \[s(z) = \frac{1}{1=e^-z}\]

NNL <- nnetar(ts_tax)

## Warning in nnetar(ts_tax): Missing values in x, omitting rows

NN <- nnetar(ts_tax, lambda="auto")

## Warning in nnetar(ts_tax, lambda = "auto"): Missing values in x, omitting rows

checkresiduals(NNL)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

accuracy(NN)

##                    ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set 3.558917 31.62584 17.70454 -Inf  Inf 0.7098338 -0.1604708

There seems some spikes going out of usual range in acf plot of residuals which may indicate little problem of residuals autocorrelation. This is very complex model but often very useful for deep learning.

4.5 Comparative forecast performance

We will compare those forecast models performances with using two strategies.

First strategy: We will compare the indicators produced by the output of the accuracy() function when all the time series is the training set.

Second strategy: Like the first strategy, we will compare the indexes produce by the output of the accuracy() function but we will split the dataset by creating, one training set and one test set.

It is important to use the right tool for an accurate analysis and for understanding the impact on the results when a different strategy is used.

We will base ours performance comparison model analysis on those indexes which will we define it before we use it.

Mean absolute error: MAE

\[MAE=mean(e^2_{t})\] Root mean squared error: RMSE

\[e_{t}={y}_{t}-\hat{y}_{t|N}\] \[RMSE=\sqrt{mean(e^2_{t})}\] Mean absolute percentage error: MAPE \[p_{t}=100e_{t}/y_{t}\] Mean absolute scaled error : MASE \[q_{t}= e_{t}/Q\] Where Q is a scaling statistic computed on the training data. \[p_{t}=100e_{t}/y_{t}\] Autocorrelations of error at lag1 (ACF) \[ACF=\frac{Covariance(x_{t},x_{t-h})}{Variance(x_{t})}\] If we print summary of each model and compare accuracy, we observe that ARIMA model seems the best while NN model seems the second best. Therefore, now we forecast with NN and ARIMA models for 24 time periods ahead.

4.4 Forecast Using ANN model

knitr::opts_chunk$set(echo = TRUE)
NN <- nnetar(ts_tax, lambda="auto")

## Warning in nnetar(ts_tax, lambda = "auto"): Missing values in x, omitting rows

fcast2 <- forecast(NN, h=24, PI=TRUE, npaths=100)
autoplot(fcast2, include = 60)

print(summary(fcast2))

## 
## Forecast method: NNAR(6,1,4)[12]
## 
## Model Information:
## 
## Average of 20 networks, each of which is
## a 7-4-1 network with 37 weights
## options were - linear output units 
## 
## Error measures:
##                    ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set 3.220608 30.30213 17.67813 -Inf  Inf 0.7087751 -0.1749367
## 
## Forecasts:
##          Point Forecast      Lo 80         Hi 80        Lo 95       Hi 95
## Aug 2019  -5.169436e+01 -158.49178 -1.094124e+01 -234.1142306  -2.7466262
## Sep 2019  -4.774822e+00  -41.09585 -8.356611e-05 -105.2675659   0.6133209
## Oct 2019   8.165244e+01   25.50489  1.827170e+02    6.8311103 304.1683222
## Nov 2019  -4.073227e+01 -111.97552 -2.938720e+00 -180.5397836   2.4055531
## Dec 2019  -4.807876e-01  -14.61266  5.179066e-01  -32.1279362  14.1405958
## Jan 2020   1.247356e+02   23.72348  2.664828e+02   10.8691422 379.6774819
## Feb 2020  -5.913015e+01 -145.65332 -7.224195e+00 -211.3976479  -0.1051606
## Mar 2020  -2.206156e+00  -20.08210  2.425274e-01  -48.3692600  16.3027448
## Apr 2020   6.786257e+01   17.54017  2.331229e+02    0.8826063 280.1054197
## May 2020  -4.151577e+01 -139.00005 -2.230847e+00 -302.6165900   2.7070464
## Jun 2020  -1.837218e-01  -10.32816  2.740475e+00  -51.5338809  45.1136965
## Jul 2020   1.273360e+02   41.27289  2.971563e+02   30.9630145 423.1870792
## Aug 2020  -5.994800e+01 -134.17387 -1.009125e+01 -238.9864997  -2.4874773
## Sep 2020  -1.252186e+00  -26.15167  1.159886e+01  -36.4481053  45.0175531
## Oct 2020   9.386853e+01   26.54873  2.415131e+02   15.8635100 354.1933832
## Nov 2020  -4.861098e+01 -131.27518 -1.249925e-01 -184.8265288   7.6857801
## Dec 2020  -5.544047e-05  -13.84367  1.457762e+01  -48.5025031  49.3396483
## Jan 2021   1.253656e+02   28.59167  2.370863e+02   11.7315148 299.5294661
## Feb 2021  -6.077255e+01 -131.07309 -6.987896e+00 -201.3598659   1.5095944
## Mar 2021  -6.062313e-02  -14.72633  2.438620e+01  -30.7252979  56.0358081
## Apr 2021   9.888162e+01   19.50173  2.578170e+02    5.5834930 383.8565299
## May 2021  -5.298765e+01 -146.86324 -3.727796e+00 -187.4641894  17.8886012
## Jun 2021   2.754169e-01  -14.07916  2.416272e+01  -41.4190751  80.2813308
## Jul 2021   1.300154e+02   32.12920  2.681639e+02    4.5053908 476.0571389
##          Point Forecast      Lo 80         Hi 80        Lo 95       Hi 95
## Aug 2019  -5.169436e+01 -158.49178 -1.094124e+01 -234.1142306  -2.7466262
## Sep 2019  -4.774822e+00  -41.09585 -8.356611e-05 -105.2675659   0.6133209
## Oct 2019   8.165244e+01   25.50489  1.827170e+02    6.8311103 304.1683222
## Nov 2019  -4.073227e+01 -111.97552 -2.938720e+00 -180.5397836   2.4055531
## Dec 2019  -4.807876e-01  -14.61266  5.179066e-01  -32.1279362  14.1405958
## Jan 2020   1.247356e+02   23.72348  2.664828e+02   10.8691422 379.6774819
## Feb 2020  -5.913015e+01 -145.65332 -7.224195e+00 -211.3976479  -0.1051606
## Mar 2020  -2.206156e+00  -20.08210  2.425274e-01  -48.3692600  16.3027448
## Apr 2020   6.786257e+01   17.54017  2.331229e+02    0.8826063 280.1054197
## May 2020  -4.151577e+01 -139.00005 -2.230847e+00 -302.6165900   2.7070464
## Jun 2020  -1.837218e-01  -10.32816  2.740475e+00  -51.5338809  45.1136965
## Jul 2020   1.273360e+02   41.27289  2.971563e+02   30.9630145 423.1870792
## Aug 2020  -5.994800e+01 -134.17387 -1.009125e+01 -238.9864997  -2.4874773
## Sep 2020  -1.252186e+00  -26.15167  1.159886e+01  -36.4481053  45.0175531
## Oct 2020   9.386853e+01   26.54873  2.415131e+02   15.8635100 354.1933832
## Nov 2020  -4.861098e+01 -131.27518 -1.249925e-01 -184.8265288   7.6857801
## Dec 2020  -5.544047e-05  -13.84367  1.457762e+01  -48.5025031  49.3396483
## Jan 2021   1.253656e+02   28.59167  2.370863e+02   11.7315148 299.5294661
## Feb 2021  -6.077255e+01 -131.07309 -6.987896e+00 -201.3598659   1.5095944
## Mar 2021  -6.062313e-02  -14.72633  2.438620e+01  -30.7252979  56.0358081
## Apr 2021   9.888162e+01   19.50173  2.578170e+02    5.5834930 383.8565299
## May 2021  -5.298765e+01 -146.86324 -3.727796e+00 -187.4641894  17.8886012
## Jun 2021   2.754169e-01  -14.07916  2.416272e+01  -41.4190751  80.2813308
## Jul 2021   1.300154e+02   32.12920  2.681639e+02    4.5053908 476.0571389

Forecast Using ARIMA model

fit_ARIMA %>% forecast(h=24) %>% autoplot()

## Warning: Removed 1 row(s) containing missing values (geom_path).

auto.arima(tax_data$direct) %>% forecast(h=24) %>% autoplot

Tax collection data forecasting for next 12 months

Zahid Asghar, School of Economics, QAU, Islamabad

1/5/2021