In this report, we have the dataset which contains Canada’s Per Capita income from 1970-2016 . We will Analyse the data, transform, visualize and then try to find out the best ARIMA model After finding the best model, we will use it to predict Income for next 5 years.
Here, we are importing all the necessary libraries that will come handy in future use.
library(readr)
library(dplyr)
library(dLagM)
library(tseries)
library(FitAR)
library(TSA)
library(fUnitRoots)
library(ggplot2)
library(lmtest)
library(forecast)
In this step we will perform two major steps, i.e importing our dataset and converting it into time series.
Here, we will import our dataset and named it as income and displays the glimpse of the dataset by showing first 6 values. The data has two variables, Year which contains yearly data and Per capita income which contains income as per different years. The dataset contains information from 1970 - 2016.
income <- read.csv("canada_per_capita_income.csv")
head(income)
In this step, I will convert the data set into time series to perform further evaluation. It is very important step as we need to perform all our analyses on the timeseries data only.
TSincome <- ts(as.vector(income$per.capita.income..US..),start=1970,end=2016)
TSincome
Time Series:
Start = 1970
End = 2016
Frequency = 1
[1] 3399.299 3768.298 4251.175 4804.463 5576.515 5998.144 7062.131 7100.126 7247.967 7602.913
[11] 8355.968 9434.391 9619.438 10416.537 10790.329 11018.956 11482.892 12974.807 15080.283 16426.725
[21] 16838.673 17266.098 16412.083 15875.587 15755.820 16369.317 16699.827 17310.758 16622.672 17581.024
[31] 18987.382 18601.397 19232.176 22739.426 25719.147 29198.056 32738.263 36144.481 37446.486 32755.177
[41] 38420.523 42334.711 42665.256 42676.468 41039.894 35175.189 34229.194
Now, we will visualize the time series dataset into the plot.
plot(TSincome,type='o',ylab="Income ",xlab='Year', col=c("Blue"),
main = "Time Series plot of Canada's Per capita Income")
After Analyzing the above graph we can say that:
Trend - We can see that there is an upward clear trend in our time series dataset which means that the income is increasing as per time.
Changing Variance - In the time series plot, change in variance is unclear, a little bit at the end we can see some.
Seasonality - It is clearly visible that there are no repeating patterns in the plot. So, is very difficult to spot seasonality.
Behavior The behavior is this time series plot is displaying moving average. Successive points in the series are also showing autoregressive behavior.
Change point - There is no sign of observed change point in plot.
Now, we will try to fit a linear model on our dataset and name it as Lmmodel. After summarizing the model, we saw that the model is significant at 95% interval level and it also has a good R2 value. I have also visualized the model by plotting it on the graph.
Lmmodel = lm(TSincome~time(TSincome))
summary(Lmmodel)
Call:
lm(formula = TSincome ~ time(TSincome))
Residuals:
Min 1Q Median 3Q Max
-7144.1 -2679.3 281.3 2425.4 8502.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.632e+06 8.613e+04 -18.95 <2e-16 ***
time(TSincome) 8.285e+02 4.321e+01 19.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4019 on 45 degrees of freedom
Multiple R-squared: 0.8909, Adjusted R-squared: 0.8885
F-statistic: 367.5 on 1 and 45 DF, p-value: < 2.2e-16
plot(TSincome,type='o',ylab='Income',xlab='Year',col=c("blue"),
main="Fitted Linear curve of Canada's Per capita Income",lwd=2)
abline(Lmmodel,col="red",lwd=2)
legend("topleft",lty=1,text.width = 19, col = c("blue","red"),
c("Canada's Per capita Income","Fitted linear curve"),bty = "n")
Now, we will perform a diagnostic check on our Linear model i.e. LmModel.
qqanalysis <- function(myresiduals, title = 'QQ Plot of Residuals') {
df=as.data.frame(qqnorm( myresiduals , plot=F))
ggplot(df,aes(x,y)) +
geom_point() +
geom_smooth(method="lm", se=FALSE, color='2', size=.3)+
xlab('Theoretical') +
ylab('standarized') +
ggtitle(title)
}
checkresiduals(Lmmodel)
Breusch-Godfrey test for serial correlation of order up to 9
data: Residuals
LM test = 40.86, df = 9, p-value = 5.303e-06
qqanalysis(residuals(Lmmodel))
shapiro.test(as.vector(residuals(Lmmodel)))
Shapiro-Wilk normality test
data: as.vector(residuals(Lmmodel))
W = 0.9772, p-value = 0.4821
Through the above plots, we can say that in:
In time series plot, we can see a declining trend and then an upward trend in the model.No repeating pattern are present, so no sign of seasonality. No clear evidence of Change points are present in the plot.
In Histogram, seems to be symmetric.
ACF plots shows significant lags and a declining pattern.
QQ plot also displays that points are not falling exactly on the line and the normality assumption is violated which is supported by Shapiro Wilk test as p value is greater then 0.05
After observing the Residual Analysis, we can say the linear trend is not a good fit. Therefore, we need to put different model on our time series.
Now, we will check for Stationarity, transform our data and then we will also check for normality of transformed data.
Here, we are checking for stationarity in our original time series by plotting ACF,PACF,QQ-Plot and ADF test.
acf(TSincome)
pacf(TSincome)
adf.test(TSincome)
Augmented Dickey-Fuller Test
data: TSincome
Dickey-Fuller = -2.4234, Lag order = 3, p-value = 0.4053
alternative hypothesis: stationary
plot(y=TSincome,x=zlag(TSincome),
ylab="Income",xlab="Previous year Income",col='blue')
The above plots shows following results:
ACF plot shows that first 9 lags are significant and a declining pattern is also visible in the plot.
PACF Plot shows that Lag 1 is significant which means very high auto correlation in the series.
Scatter plot also shows the positive correlation as their is clear upward trend visible in the plot.
Both ACF and PACF plots indicates that there is trend and non stationarity present in series.
ADF test supports the above statement as P-value is greater then 0.05.
#computing correlation
a=TSincome
b=zlag(TSincome)
index=2:length(b)
cor(b[index],a[index])
[1] 0.9870614
The above correlation result of 98% confirms that there is very high auto correlation present in the series.
Now, I will try to transform the dataset by using Box-Cox transformation.
TransformedTS=BoxCox.ar(TSincome,method = "yule-walker")
TransformedTS$ci
[1] 0.3 0.9
Above, represents the box-cox at 95% significane level. Through the above Chi I found out that the value of lambda is 0.6((0.3+0.9)/2).
Now, I will fit the transformed series and see the results.
lambda = 0.6
BCIncome=(TSincome^lambda-1)/lambda
plot(BCIncome, main="Time series plot after Box-Cox transformation")
adf.test(BCIncome)
Augmented Dickey-Fuller Test
data: BCIncome
Dickey-Fuller = -2.5992, Lag order = 3, p-value = 0.3353
alternative hypothesis: stationary
After the Box-Cox transformation, the plot appears to be the same and ADF Test shows the p-value of 0.33 which is greater then 0.05.So,I will fail to reject the null hypothesis and the series is still non stationary.
Now, I will use Differencing to transform our series as Box-Cox transformation doesn’t help us.
To transform our series, i will use 1st Differencing and observe the output.
DiffBCIncome = diff(BCIncome)
plot(DiffBCIncome,type='o',ylab='Income',col="4",main="TS Plot after First Differencing")
abline(h=0, col ="6", lty=1)
adf.test(DiffBCIncome, alternative = c("stationary"))
Augmented Dickey-Fuller Test
data: DiffBCIncome
Dickey-Fuller = -2.5308, Lag order = 3, p-value = 0.3629
alternative hypothesis: stationary
Above i have transformed the series with 1st differencing and plotted the graph for same. In the graph, we can observe that the series is much stable as compare to the original series. Trend is gone, but we can see change in points and moving average is much stable but there is some shift observed at the end.
ADF test also shows the P-value of 0.36 which is greater then 0.05 , we cannot reject the null hypothesis.
ACF and PACF also seems to be in much control.
Therefore, we can say that the series is still non stationary and we need to perform 2nd differencing.
Now, I will perform second differencing on the series and observe the results.
# 2nd Differencing
Diff2BCIncome = diff(BCIncome, differences=2)
plot(Diff2BCIncome,type='o',ylab='Income',col="orange",main="TS Plot after Second Differencing")
abline(h=0, col ="3", lty=1)
adf.test(Diff2BCIncome)
Augmented Dickey-Fuller Test
data: Diff2BCIncome
Dickey-Fuller = -4.3795, Lag order = 3, p-value = 0.01
alternative hypothesis: stationary
After observing the above plot of second differencing, we can see that the moving average of plot is more stable, no trend is there.
ADF Test shows p-value of 0.01, which is smaller then 0.05. Thus, we can finally reject the null hypothesis and now the series has become stationary.
Now, we will Try to find out our best ARIMA Model by evaluating possible best Models. For that, i have plotted ACF Plots, PACF Plots, EACF Table and BIC Table.
acf(Diff2BCIncome,main="ACF Plot After second differencing")
pacf(Diff2BCIncome,main="PACF Plot After second differencing")
eacf(Diff2BCIncome,ar.max = 6,ma.max = 6)
AR/MA
0 1 2 3 4 5 6
0 x o o o o o o
1 x x o o o o o
2 o o o o o o o
3 x o x o o o o
4 o o o o o o o
5 x x x o o o o
6 x o o o o o o
res=armasubsets(y=Diff2BCIncome,nar=6,nma=6,y.name='test',ar.method = 'yw')
Reordering variables and trying again:
plot(res)
From ACF and PACF, We found ARIMA(2,2,1).
From EACF Table, We found ARIMA(0,2,1), ARIMA(0,2,2), ARIMA(1,2,2).
From BIC Table, We found ARIMA(6,2,5), ARIMA(6,2,4).
Therefore, our possible best models are:
ARIMA(2,2,1)
ARIMA(0,2,1)
ARIMA(0,2,2)
ARIMA(1,2,2)
ARIMA(6,2,5)
ARIMA(6,2,4).
Now, we will try to find out our best model and perform overfitting and residual analysis on it.
Here, I have analysed each possible model by applying coeeficient test on each model to get best possible outcomes.
# ARIMA(2,2,1)
model_221_css = arima(BCIncome, order=c(2,2,1),method='CSS')
coeftest(model_221_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.348055 0.151631 2.2954 0.02171 *
ar2 -0.128571 0.176614 -0.7280 0.46663
ma1 -1.010226 0.031413 -32.1598 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_221_ML = arima(BCIncome, order=c(2,2,1),method='ML')
coeftest(model_221_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.338673 0.148514 2.2804 0.02258 *
ar2 -0.063588 0.169258 -0.3757 0.70715
ma1 -1.000000 0.071919 -13.9044 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above test, we can say that in ARIMA(2,2,1):
CSS: Co-efficient of AR(1) and MA(1) is statistically significant while AR(2) is not statistically significant.
ML: Co-efficient of AR(1) and MA(1) is statistically significant while AR(2) is not statistically significant.
So, we can say that this model can’t be a good fit as all the co-efficients are not statistically significant.
# ARIMA(0,2,1)
model_021_css = arima(BCIncome, order=c(0,2,1),method='CSS')
coeftest(model_021_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ma1 -0.70230 0.15658 -4.4853 7.281e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_021_ML = arima(BCIncome, order=c(0,2,1),method='ML')
coeftest(model_021_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ma1 -1.00000 0.10828 -9.2356 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above test, we can say that in ARIMA(0,2,1):
CSS: Co-efficient of MA(1) is statistically significant.
ML: Co-efficient of MA(1) is statistically significant..
So, we can say that this model can be a good fit as all the co-efficients are statistically significant.
# ARIMA(0,2,2)
model_022_css = arima(BCIncome, order=c(0,2,2),method='CSS')
coeftest(model_022_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ma1 -0.69597 0.14885 -4.6756 2.931e-06 ***
ma2 -0.36874 0.15624 -2.3601 0.01827 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_022_ML = arima(BCIncome, order=c(0,2,2),method='ML')
coeftest(model_022_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ma1 -0.63951 0.16160 -3.9574 7.576e-05 ***
ma2 -0.36049 0.14698 -2.4526 0.01418 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above test, we can say that in ARIMA(0,2,2):
CSS: Co-efficient of MA(1) and MA(2) are statistically significant.
ML: Co-efficient of MA(1) and MA(2) are statistically significant..
So, we can say that this model can be a good fit as all the co-efficients are statistically significant.
# ARIMA(1,2,2)
model_122_css = arima(BCIncome, order=c(1,2,2),method='CSS')
coeftest(model_122_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.34574 0.56107 -0.6162 0.5377
ma1 -0.40772 0.35905 -1.1355 0.2561
ma2 -0.67414 0.44728 -1.5072 0.1318
model_122_ML = arima(BCIncome, order=c(1,2,2),method='ML')
coeftest(model_122_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.073116 0.421034 -0.1737 0.8621
ma1 -0.574754 0.390344 -1.4724 0.1409
ma2 -0.425237 0.384689 -1.1054 0.2690
From the above test, we can say that in ARIMA(1,2,2):
CSS: Co-efficient of AR(1),MA(1) and MA(2) are not statistically significant.
ML: Co-efficient of AR(1),MA(1) and MA(2) are not statistically significant.
So, we can say that this model can’t be a good fit as all the co-efficients are not statistically significant.
# ARIMA(6,2,5)
model_625_css = arima(BCIncome, order=c(6,2,5),method='CSS')
coeftest(model_625_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.39769593 0.00087857 -452.6636 < 2.2e-16 ***
ar2 -0.64227658 NA NA NA
ar3 -0.45189566 0.00186568 -242.2145 < 2.2e-16 ***
ar4 -0.51462187 NA NA NA
ar5 -0.37773713 0.05939065 -6.3602 2.015e-10 ***
ar6 -0.15934388 NA NA NA
ma1 -0.72732873 0.07462120 -9.7469 < 2.2e-16 ***
ma2 0.38815069 NA NA NA
ma3 -0.17692976 NA NA NA
ma4 -0.83237879 0.07494261 -11.1069 < 2.2e-16 ***
ma5 -0.87266172 NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_625_ML = arima(BCIncome, order=c(6,2,5),method='ML')
coeftest(model_625_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.92433 0.17843 -5.1802 2.217e-07 ***
ar2 -0.58395 0.22959 -2.5435 0.010975 *
ar3 -0.69572 0.23071 -3.0156 0.002565 **
ar4 -0.62861 0.26513 -2.3710 0.017742 *
ar5 -0.29014 0.26458 -1.0966 0.272815
ar6 -0.23366 0.21519 -1.0859 0.277545
ma1 0.30044 0.15393 1.9518 0.050965 .
ma2 -0.20992 0.15508 -1.3536 0.175860
ma3 0.20993 0.15602 1.3455 0.178468
ma4 -0.30043 0.15952 -1.8833 0.059656 .
ma5 -0.99999 0.15465 -6.4660 1.006e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above test, we can say that in ARIMA(6,2,5):
CSS: Co-efficient of AR(1), AR(3), AR(5), MA(1), MA(4) are statistically significant while rest of the co-efficient values are not available so we cannot say anything about it.
ML: Co-efficient of AR(1), AR(2), AR(3),AR(4), MA(5) are statistically significant while rest of the co-efficient values are statisticlly not significant.
So, we can say that this model can’t be a good fit as all the co-efficients are not statistically significant.
# ARIMA(6,2,4)
model_624_css = arima(BCIncome, order=c(6,2,4),method='CSS')
coeftest(model_624_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.33375749 0.00069709 -478.7850 < 2.2e-16 ***
ar2 -0.63046110 0.00630004 -100.0725 < 2.2e-16 ***
ar3 -0.62228098 0.00159243 -390.7741 < 2.2e-16 ***
ar4 -0.26536747 0.04499961 -5.8971 3.699e-09 ***
ar5 -0.43823891 0.05619933 -7.7979 6.293e-15 ***
ar6 -0.14021400 0.03111667 -4.5061 6.604e-06 ***
ma1 -0.81277977 0.10599244 -7.6683 1.743e-14 ***
ma2 0.62672622 0.18212417 3.4412 0.0005791 ***
ma3 -0.37081530 0.16453317 -2.2537 0.0242124 *
ma4 -1.42817852 0.16673343 -8.5656 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_624_ML = arima(BCIncome, order=c(6,2,4),method='ML')
coeftest(model_624_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.934474 0.456490 2.0471 0.0406495 *
ar2 -1.079664 0.410932 -2.6274 0.0086052 **
ar3 0.481364 0.541100 0.8896 0.3736793
ar4 -0.277033 0.326815 -0.8477 0.3966176
ar5 0.010532 0.294677 0.0357 0.9714900
ar6 0.229018 0.177717 1.2887 0.1975139
ma1 -1.660654 0.453166 -3.6646 0.0002478 ***
ma2 1.732041 0.707214 2.4491 0.0143211 *
ma3 -1.207485 0.732721 -1.6479 0.0993636 .
ma4 0.136101 0.473569 0.2874 0.7738104
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above test, we can say that in ARIMA(6,2,4):
CSS: All the Co-efficient are statistically significant.
ML: Co-efficient of AR(1), AR(2), MA(1),MA(2) are statistically significant while rest of the co-efficient values are statistically not significant.
So, we can say that this model can’t be a good fit as all the co-efficients are not statistically significant.
Above, we analyzed all the possible models on the basis of their Co-efficients. we found that the best possible models on the basis of their co-efficients are :
ARIMA(0,2,1)
ARIMA(0,2,2)
Now, we will Sort out them on the basis of their AIC and BIC to find the best Model.
Here we are sorting models on the basis of AIC and BIC values.
# Sorting AIC and BIC values
sort.score <- function(x, score = c("bic", "aic")){
if (score == "aic"){
x[with(x, order(AIC)),]
} else if (score == "bic") {
x[with(x, order(BIC)),]
} else {
warning('score = "x" only accepts valid arguments ("aic","bic")')
}
}
sort.score(AIC(model_221_ML,model_021_ML,model_022_ML,model_122_ML,model_625_ML,
model_624_ML),score="aic")
sort.score(BIC(model_221_ML,model_021_ML,model_022_ML,model_122_ML,model_625_ML,
model_624_ML),score="bic")
After sorting and observing AIC and BIC values of all possible models, we found out that ARIMA(0,2,2) is the best model. If, we look at the co efficients of the ARIMA(0,2,2) we found out that all the co-efficients are statistically significant. Thus, we can say that ARIMA(0,2,2) is our best Model.
Now, we found our best model i.e ARIMA(0,2,2), we will try to see if any other best neighbor model is there or not. So best possible neighbor model can be ARIMA(1,2,2) and ARIMA(0,2,1). As we have already evaluated ARIMA(0,2,1), we will check ARIMA(1,2,2) by performing co-efficients test on it.
# ARIMA(1,2,2)
model_122_css = arima(BCIncome, order=c(1,2,2),method='CSS')
coeftest(model_122_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.34574 0.56107 -0.6162 0.5377
ma1 -0.40772 0.35905 -1.1355 0.2561
ma2 -0.67414 0.44728 -1.5072 0.1318
model_122_ML = arima(BCIncome, order=c(1,2,2),method='ML')
coeftest(model_122_ML)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.073116 0.421034 -0.1737 0.8621
ma1 -0.574754 0.390344 -1.4724 0.1409
ma2 -0.425237 0.384689 -1.1054 0.2690
From the above test, we can say that in ARIMA(1,2,2):
CSS: All the Co-efficient are not statistically significant.
ML: All the Co-efficient are not statistically significant.
So, we can say that this model can’t be a good fit as all the co-efficients are not statistically significant.
Therefore, ARIMA(0,2,2) is our best model.
Now, we will see some residual analysis of the best model.
# Residual Analysis
residual.analysis <- function(model, std = TRUE){
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
abline(h=0)
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardised residuals")
pacf(res.model,main="PACF of standardised residuals")
print(shapiro.test(res.model))
k=0
LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
par(mfrow=c(1,1))
}
residual.analysis(model = model_022_ML)
Shapiro-Wilk normality test
data: res.model
W = 0.87346, p-value = 0.0001136
par(mfrow=c(1,1))
shapiro.test(as.vector(residuals(model_022_ML)))
Shapiro-Wilk normality test
data: as.vector(residuals(model_022_ML))
W = 0.87346, p-value = 0.0001136
Through the above plots, we can say that in:
In time series plot, we can see that the plot is much stable, no trend is visible, no seasonality, a couple of change in points,no intervention and some points suggests that auto correlation is present. Overall plot seems to be moving around the central line.
Histogram seems to be symmetric and normally distributed.
ACF & PACF plots are in limits, no significant lags. Overall they look good.
In QQ-Plot, we can see deviation and some outliers are there in plot while the Shapiro Wilk test confirms that there is no normality in the series as P value is 0.0001 which is less then 0.005.
In the Ljung-Box test, we can see that all the points are well above the reference line.
After observing the Residual Analysis, we can say the model is a good fit. Therefore, we can move forward to predicting our income.
Now, we will predict Canada’s per capita income for the next 5 years i.e from 2017-2021.
# Forecasting
fit = Arima(TSincome,c(0,2,2), lambda = 0.6)
modelforecast = forecast::forecast(fit,h=5)
modelforecast
plot(modelforecast)
Above, we can see the predicted income of Canada’s people for the year 2017-2021 and also at 80% and 95% significance level and a graph has been plotted for the same.
In this Project report, we have chosen Dataset which contains Canada’s per capita income from 1970-2016 and our main aim was to predict the Income for next years. For that, We have perform some steps. First, we imported the data and converted it into a time series.After that, fit a linear model and perform diagnostic check on the model and found out that model was not the best model.Then, we transformed the data by Box-cox transformation as the data was not stationary and after the transformation found that data was still non stationary and there was not much difference. then, i tried applying differencing method of transformation and after 2nd differencing found that the data was stationary and is good for use. Then, with the help or ACF, PACF, EACF and BIC table found the possible best ARIMA models and performed co-efficients test on them and with the help AIC and BIC score found that ARIMA(0,2,2) was the best available model with highest AIC and BIC value and all the co-efficients were significant. then, tried overfitting and still found that ARIMA(0,2,2) was the best model available.Then performed residual analysis on the best model and after that predicted the next 5 years per capita income of Canada’s people i.e for the year 2017-2021 and visualized the graph for same.